Agent skill

debug-pdf

Automated PDF failure analysis and fixture generation. Takes failed PDF URLs, identifies breaking patterns, and generates minimal fixtures via fixture-tricky for regression testing. Supports batch mode and combined stress test generation.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/debug-pdf

Metadata

Additional technical details for this skill

short description: Failure-to-fixture automation for PDF extractors

SKILL.md

Debug PDF Skill

Automate the lifecycle of an extraction failure: Failure -> Analysis -> Fixture -> Test

Why This Exists

Extractors (Marker, Surya, Camelot) break on specific PDF patterns (scanned pages, TOC dots, cursed fonts, watermarks). Manually reproducing these bugs is slow. /debug-pdf fast-tracks this by:

Downloading the failed artifact
Identifying structural "traps" (TOC dots, watermarks, ligatures, etc.)
Generating minimal reproduction fixtures using fixture-tricky
Combining multiple failures into a single stress test PDF

Quick Start

bash

# Analyze a single failed URL
./run.sh analyze "https://example.com/broken.pdf"

# Process multiple failures in batch
./run.sh batch failed_urls.txt --output report.json

# Combine all fixtures into one stress test PDF
./run.sh combine stress_test.pdf --max-pages 20

# List known failure patterns
./run.sh list-patterns

# Check session status
./run.sh status

Commands

analyze

Analyze a single PDF URL and optionally generate a reproduction fixture.

bash

./run.sh analyze "https://example.com/broken.pdf"
./run.sh analyze "https://example.com/broken.pdf" --no-repro
./run.sh analyze "https://example.com/broken.pdf" --send-inbox

batch

Process multiple URLs from a file (one URL per line).

bash

# Create URL file
echo "https://example.com/doc1.pdf" > failed.txt
echo "https://example.com/doc2.pdf" >> failed.txt

# Run batch analysis
./run.sh batch failed.txt --output analysis.json --send-inbox

combine [output.pdf]

Merge all generated fixtures into a single stress test PDF.

bash

./run.sh combine stress_test.pdf --max-pages 15

list-patterns

Display all known failure patterns and their descriptions.

status

Show current debug session status and fixture count.

Detected Patterns (14/17 = 82%)

Structural (4/4 detected):

scanned_no_ocr - Scanned image PDF without text layer
sparse_content_slides - Slide deck with minimal text per page
multi_column - Complex multi-column layouts (via text block analysis)
watermarks - Text obscured by watermark overlays

Encoding (5/5 detected):

toc_noise - Table of contents with dotted leaders
metadata_artifacts - Print metadata (Jkt/PO/Frm) in content
invisible_chars - Zero-width spaces, direction markers
curly_quotes - Windows-1252 encoded smart quotes
ligatures - fi/fl/ff ligature characters

Layout (4/4 detected):

footnotes_inline - Footnotes merged into body text (via font size/position heuristics)
split_tables - Tables spanning multiple pages (flag only, no merging)
header_footer_bleed - Headers/footers mixed into content (via PyMuPDF4LLM Layout)
diagram_heavy - Many embedded diagrams/charts

Network (1/3 detected locally):

archive_org_wrap - Wayback Machine URL wrapper (detected via URL pattern)
auth_required - Marketing platform cookie gates (network-level, not detectable locally)
access_restricted - Government/defense access controls (network-level, not detectable locally)

Workflow Integration

When memory or extractor agent reports failures:

Collect failed URLs in a text file
Run batch analysis: ./run.sh batch failed_urls.txt
Review pattern distribution in output
Generate combined stress test: ./run.sh combine stress_test.pdf
Add stress test to extractor's regression suite
New patterns get added to fixture-tricky for future testing

Data Storage

All data is stored in ~/.pi/debug-pdf/:

sessions/ - Individual analysis session JSON files
fixtures/ - Generated reproduction PDFs
last_analysis.json - Quick reference to most recent analysis

Dependencies

pymupdf (fitz) - PDF structure analysis
pymupdf4llm - ML-based layout detection for header/footer bleed
httpx - HTTP downloads with redirect handling
typer - CLI interface
loguru - Logging

Sibling skills used:

fetcher - Robust URL downloading with Playwright support
fixture-tricky - Adversarial PDF generation
extractor - Verification of generated fixtures
agent-inbox - Cross-agent notifications

Testing

bash

# Run test suite (24 tests)
python -m pytest tests/test_debug_pdf.py -v

# Generate test fixtures only
python tests/test_debug_pdf.py

Test coverage includes:

URL validation (security hardening)
Wayback URL detection and extraction
Multi-column layout detection
Header/footer bleed detection
Split table detection
Footnote detection
Full PDF analysis integration

Sanity Check

bash

./sanity.sh

Verifies:

Python dependencies installed
Sibling skills available
Data directory accessible
CLI commands functional

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/debug-pdf
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Debug PDF Skill

Why This Exists

Quick Start

Commands

analyze

batch

combine [output.pdf]

list-patterns

status

Detected Patterns (14/17 = 82%)

Workflow Integration

Data Storage

Dependencies

Testing

Sanity Check

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state