Agent skill

debug-pdf

Automated PDF failure analysis and fixture generation. Takes failed PDF URLs, identifies breaking patterns, and generates minimal fixtures via fixture-tricky for regression testing. Supports batch mode and combined stress test generation.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/debug-pdf

Metadata

Additional technical details for this skill

short description
Failure-to-fixture automation for PDF extractors

SKILL.md

Debug PDF Skill

Automate the lifecycle of an extraction failure: Failure -> Analysis -> Fixture -> Test

Why This Exists

Extractors (Marker, Surya, Camelot) break on specific PDF patterns (scanned pages, TOC dots, cursed fonts, watermarks). Manually reproducing these bugs is slow. /debug-pdf fast-tracks this by:

  1. Downloading the failed artifact
  2. Identifying structural "traps" (TOC dots, watermarks, ligatures, etc.)
  3. Generating minimal reproduction fixtures using fixture-tricky
  4. Combining multiple failures into a single stress test PDF

Quick Start

bash
# Analyze a single failed URL
./run.sh analyze "https://example.com/broken.pdf"

# Process multiple failures in batch
./run.sh batch failed_urls.txt --output report.json

# Combine all fixtures into one stress test PDF
./run.sh combine stress_test.pdf --max-pages 20

# List known failure patterns
./run.sh list-patterns

# Check session status
./run.sh status

Commands

analyze

Analyze a single PDF URL and optionally generate a reproduction fixture.

bash
./run.sh analyze "https://example.com/broken.pdf"
./run.sh analyze "https://example.com/broken.pdf" --no-repro
./run.sh analyze "https://example.com/broken.pdf" --send-inbox

batch

Process multiple URLs from a file (one URL per line).

bash
# Create URL file
echo "https://example.com/doc1.pdf" > failed.txt
echo "https://example.com/doc2.pdf" >> failed.txt

# Run batch analysis
./run.sh batch failed.txt --output analysis.json --send-inbox

combine [output.pdf]

Merge all generated fixtures into a single stress test PDF.

bash
./run.sh combine stress_test.pdf --max-pages 15

list-patterns

Display all known failure patterns and their descriptions.

status

Show current debug session status and fixture count.

Detected Patterns (14/17 = 82%)

Structural (4/4 detected):

  • scanned_no_ocr - Scanned image PDF without text layer
  • sparse_content_slides - Slide deck with minimal text per page
  • multi_column - Complex multi-column layouts (via text block analysis)
  • watermarks - Text obscured by watermark overlays

Encoding (5/5 detected):

  • toc_noise - Table of contents with dotted leaders
  • metadata_artifacts - Print metadata (Jkt/PO/Frm) in content
  • invisible_chars - Zero-width spaces, direction markers
  • curly_quotes - Windows-1252 encoded smart quotes
  • ligatures - fi/fl/ff ligature characters

Layout (4/4 detected):

  • footnotes_inline - Footnotes merged into body text (via font size/position heuristics)
  • split_tables - Tables spanning multiple pages (flag only, no merging)
  • header_footer_bleed - Headers/footers mixed into content (via PyMuPDF4LLM Layout)
  • diagram_heavy - Many embedded diagrams/charts

Network (1/3 detected locally):

  • archive_org_wrap - Wayback Machine URL wrapper (detected via URL pattern)
  • auth_required - Marketing platform cookie gates (network-level, not detectable locally)
  • access_restricted - Government/defense access controls (network-level, not detectable locally)

Workflow Integration

When memory or extractor agent reports failures:

  1. Collect failed URLs in a text file
  2. Run batch analysis: ./run.sh batch failed_urls.txt
  3. Review pattern distribution in output
  4. Generate combined stress test: ./run.sh combine stress_test.pdf
  5. Add stress test to extractor's regression suite
  6. New patterns get added to fixture-tricky for future testing

Data Storage

All data is stored in ~/.pi/debug-pdf/:

  • sessions/ - Individual analysis session JSON files
  • fixtures/ - Generated reproduction PDFs
  • last_analysis.json - Quick reference to most recent analysis

Dependencies

  • pymupdf (fitz) - PDF structure analysis
  • pymupdf4llm - ML-based layout detection for header/footer bleed
  • httpx - HTTP downloads with redirect handling
  • typer - CLI interface
  • loguru - Logging

Sibling skills used:

  • fetcher - Robust URL downloading with Playwright support
  • fixture-tricky - Adversarial PDF generation
  • extractor - Verification of generated fixtures
  • agent-inbox - Cross-agent notifications

Testing

bash
# Run test suite (24 tests)
python -m pytest tests/test_debug_pdf.py -v

# Generate test fixtures only
python tests/test_debug_pdf.py

Test coverage includes:

  • URL validation (security hardening)
  • Wayback URL detection and extraction
  • Multi-column layout detection
  • Header/footer bleed detection
  • Split table detection
  • Footnote detection
  • Full PDF analysis integration

Sanity Check

bash
./sanity.sh

Verifies:

  • Python dependencies installed
  • Sibling skills available
  • Data directory accessible
  • CLI commands functional

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results