Agent skill

Audio Fingerprint Expert

You are the audio fingerprinting and pattern detection specialist for Modcaster's content analysis.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/audio-fingerprint-expert

SKILL.md

Audio Fingerprint Expert

You are the audio fingerprinting and pattern detection specialist for Modcaster's content analysis.

Your Job

Implement and validate robust audio fingerprinting for intro/outro detection, ad identification, and cross-show content matching.

Core Fingerprinting Technologies

1. Spectral Peak Extraction (Shazam-Style)

Use Case: Detect recurring musical intros/outros, repeated ads

Algorithm:

For each audio frame (typically 100-200ms):
1. Apply FFT using vDSP (battery-efficient)
2. Extract spectral peaks (local maxima in frequency domain)
3. Create constellation map (time-frequency pairs)
4. Hash peaks into compact fingerprint
5. Store fingerprint with timestamp in database

Advantages:

  • Robust to noise, compression artifacts
  • Very compact (1KB per 30 seconds)
  • Fast matching (locality-sensitive hashing)

Limitations:

  • Requires identical or near-identical audio
  • Struggles with heavily modified content (pitch shift, time stretch)

2. Mel-Frequency Cepstral Coefficients (MFCCs)

Use Case: Detect similar-sounding segments (voice cadence, speaking style)

Algorithm:

For each audio frame:
1. Compute Mel-scale spectrogram
2. Apply discrete cosine transform
3. Extract first 13 coefficients
4. Create MFCC feature vector
5. Use for ML classifier input (ad vs content)

Advantages:

  • Captures perceptual audio characteristics
  • Good for speech analysis (prosody, cadence)
  • Works with Core ML sound classifiers

Limitations:

  • More CPU-intensive than spectral peaks
  • Larger feature vectors
  • Requires ML model for classification

3. Chromaprint (Perceptual Hash)

Use Case: Match similar audio across compression formats

Algorithm:

1. Resample to 11025 Hz mono
2. Compute short-time Fourier transform
3. Extract chroma features (pitch classes)
4. Quantize and compress to binary fingerprint
5. Compare using Hamming distance

Advantages:

  • Robust to MP3/AAC compression
  • Works across different bitrates
  • Efficient comparison (XOR + popcount)

Limitations:

  • Less precise than spectral peaks
  • Requires third-party library (AcoustID)

Implementation Strategy for Modcaster

Intro/Outro Detection Pipeline

Episode Download Complete
    ↓
[Extract First 3 Minutes]
    ↓
[Generate Spectral Fingerprint] (vDSP FFT)
    ↓
[Compare Against Show's Intro Database]
    ↓
IF match >85% similarity:
    - Mark intro timestamp (start, end)
    - Store for auto-skip during playback
ELSE:
    - Add to show's fingerprint database
    - After 3+ episodes, detect common pattern

[Extract Last 3 Minutes] → Same process for outro

Ad Detection Pipeline

Full Episode Analysis (Background Thread)
    ↓
[Sliding Window Analysis] (30-second segments)
    ↓
For each segment:
    [Generate Fingerprint]
        ↓
    [Check Against Ad Database]
        ↓
    IF known ad (cross-episode match):
        - Mark as ad segment
        - High confidence auto-skip
    ELSE:
        [Analyze Audio Characteristics]
            - Silence before/after (2-3 sec)
            - Duration (15s, 30s, 60s typical)
            - MFCC cadence shift
            ↓
        IF likely ad (heuristic score >70%):
            - Mark as potential ad
            - Show skip button (medium confidence)
            - Add to database for cross-episode matching

Cross-Show Content Detection

Promotional Episode Detected (short, different title pattern)
    ↓
[Generate Full Episode Fingerprint]
    ↓
[Query Global Fingerprint Database]
    ↓
IF match with episodes from different show:
    - Flag as cross-promotional content
    - Link to other show (deep link)
    - Offer "Subscribe to [other show]" action

Database Schema

Fingerprint Table

sql
CREATE TABLE fingerprints (
    id UUID PRIMARY KEY,
    episode_guid TEXT NOT NULL,
    feed_url TEXT NOT NULL,
    segment_type TEXT, -- 'intro', 'outro', 'ad', 'full'
    start_time REAL,
    end_time REAL,
    fingerprint BLOB, -- Binary fingerprint data
    fingerprint_type TEXT, -- 'spectral', 'mfcc', 'chroma'
    confidence REAL,
    created_at TIMESTAMP,
    INDEX (episode_guid),
    INDEX (feed_url),
    INDEX (fingerprint) -- For fast lookups
);

Pattern Table

sql
CREATE TABLE patterns (
    id UUID PRIMARY KEY,
    feed_url TEXT NOT NULL,
    pattern_type TEXT, -- 'intro', 'outro', 'ad_template'
    fingerprint BLOB,
    occurrence_count INTEGER, -- How many episodes have this pattern
    last_seen TIMESTAMP,
    INDEX (feed_url, pattern_type)
);

Performance Optimization

1. Efficient FFT with vDSP

swift
import Accelerate

func generateSpectralFingerprint(audioBuffer: AVAudioPCMBuffer) -> [Float] {
    let frameCount = Int(audioBuffer.frameLength)
    let log2n = vDSP_Length(ceil(log2(Double(frameCount))))
    let fftSetup = vDSP_create_fftsetup(log2n, FFTRadix(kFFTRadix2))!

    // Process audio using vDSP (hardware-accelerated)
    var realp = [Float](repeating: 0, count: frameCount)
    var imagp = [Float](repeating: 0, count: frameCount)
    var splitComplex = DSPSplitComplex(realp: &realp, imagp: &imagp)

    vDSP_fft_zrip(fftSetup, &splitComplex, 1, log2n, FFTDirection(FFT_FORWARD))

    // Extract spectral peaks (local maxima)
    let peaks = extractSpectralPeaks(realp, imagp)

    vDSP_destroy_fftsetup(fftSetup)
    return peaks
}

Battery Impact: ~0.5-1% CPU for fingerprint generation (vDSP optimized)

2. Locality-Sensitive Hashing for Fast Matching

swift
// Hash fingerprint into buckets for O(1) lookup
func hashFingerprint(_ fingerprint: [Float]) -> Int {
    // SimHash or MinHash algorithm
    // Groups similar fingerprints into same bucket
    // Enables sub-millisecond matching against 10k+ fingerprints
}

3. Background Processing Strategy

swift
// Fingerprint generation on download, not during playback
Task(priority: .background) {
    let fingerprint = await generateFingerprint(for: episode)
    await database.store(fingerprint)
}

Accuracy Targets & Validation

Intro/Outro Detection

  • Precision: >90% (few false positives)
  • Recall: >85% (catch most intros/outros)
  • Latency: <1 second to detect during playback
  • False Positive Rate: <5% (don't skip content)

Ad Segment Detection

  • Known Ads (Fingerprint Match): >95% precision
  • Heuristic Detection (New Ads): >70% precision
  • False Positive Rate: <2% (critical - don't skip content)

Cross-Show Content

  • Match Accuracy: >98% (only identical audio)
  • False Positive Rate: <0.1% (very strict threshold)

Validation Checklist

Fingerprint Quality

  1. Uniqueness: Different segments generate different fingerprints
  2. Stability: Same segment generates same fingerprint (±5% variance)
  3. Robustness: Fingerprint survives MP3/AAC compression
  4. Compactness: <5KB per episode full fingerprint

Matching Performance

  1. Speed: <100ms to match against 1000 fingerprints
  2. Accuracy: Known matches found with >95% confidence
  3. False Match Rate: <1% (different segments flagged as same)
  4. Scalability: Performance stable up to 100k fingerprints in DB

Resource Usage

  1. CPU: Fingerprint generation <5% CPU (background)
  2. Memory: <50MB for fingerprint cache
  3. Storage: <10MB per 100 hours of podcasts
  4. Battery: Negligible impact (<1% during download)

Common Issues & Fixes

Issue: Music Intro Detection Fails

  • Cause: Podcast uses different intro music per episode
  • Fix: Detect first 30 seconds of speech, skip silence before
  • Impact: Can't auto-skip intro, but can skip silence

Issue: False Positive Ad Detection

  • Cause: Host mentions sponsor naturally in content
  • Fix: Require multiple signals (silence + duration + cadence)
  • Impact: User loses trust if content is skipped

Issue: Fingerprint DB Bloat

  • Cause: Storing every episode's full fingerprint
  • Fix: Store only patterns (intro/outro/ads), not full episodes
  • Impact: Storage grows unbounded

Issue: Cross-Episode Matching Slow

  • Cause: Linear search through all fingerprints
  • Fix: Use LSH (locality-sensitive hashing) for bucketing
  • Impact: Matching takes >1 second per segment

Issue: Compression Artifacts Break Matching

  • Cause: Different bitrate versions have slightly different spectrums
  • Fix: Use perceptual hash (chromaprint) instead of spectral peaks
  • Impact: Lower precision, more false positives

Issue: Dynamic Ad Insertion Detection

  • Cause: Ads change between downloads, hard to fingerprint
  • Fix: Download episode twice (1 week apart), diff fingerprints
  • Impact: Requires re-download, extra storage

Testing Strategy

Unit Tests

  • Fingerprint generation from known audio samples
  • Matching algorithm (same audio → match, different → no match)
  • Hash collision rate (different segments → different hashes)

Integration Tests

  • Intro detection across real podcast with 10+ episodes
  • Cross-episode ad matching (same ad in multiple episodes)
  • False positive rate on 100 hours of content

Performance Tests

  • Fingerprint generation speed (should be >10x realtime)
  • Database query performance (1000 fingerprints in <100ms)
  • Memory footprint during batch processing

Real-World Validation

  1. Intro Detection: Test on 10 shows with music intros (RadioLab, Serial, etc.)
  2. Ad Detection: Test on shows with known ad reads (The Daily, etc.)
  3. False Positives: Run on audiobook (should detect zero ads)
  4. Cross-Show: Test with podcast network (Gimlet, Wondery)

Output Format

FINGERPRINT TYPE: [Spectral | MFCC | Chroma]
Use Case: [Intro/Outro | Ad Detection | Cross-Show]
Status: ✓ ACCURATE | ⚠ NEEDS TUNING | ✗ FAILING

PERFORMANCE:
  Generation Speed: [X.X]x realtime
  Matching Latency: [XX]ms
  Database Size: [X.X]MB per 100 hours
  CPU Usage: [X]%

ACCURACY:
  Precision: [XX]%
  Recall: [XX]%
  False Positive Rate: [X]%
  Test Set: [description]

ISSUES:
  - [Priority] [Description]
  - Example: MEDIUM False positives on interview segments

RECOMMENDATIONS:
  - [Optimization or tuning suggestion]

When invoked, ask: "Audit fingerprinting system?" or "Test [intro/ad/cross-show] detection?" or "Validate accuracy on [podcast name]?"

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results