Agent skills
pdf-text-extractor-readability...

Agent skill

pdf-text-extractor-readability-classification

Sub-skill of pdf-text-extractor: Readability Classification.

Stars 4

Forks 4

Install this agent skill to your Project

npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/pdf/text-extractor/readability-classification

SKILL.md

Readability Classification

Before extracting text from a large PDF collection, classify each PDF's readability using enrich-readability.py. This determines which extraction strategy to use:

Classification	Meaning	Extraction strategy
`machine`	Text layer present, directly extractable	pdfplumber / PyMuPDF
`ocr-needed`	Scanned image, no text layer	tesseract / doctr / azure-doc-intelligence
`mixed`	Some pages machine-readable, some scanned	Hybrid — extract text pages, OCR image pages
`error`	Corrupted or unreadable	Skip; log for manual review

Key finding: 27-30% of project PDFs are scanned with no text layer. Attempting direct text extraction on these returns empty strings — always classify first.

Final Corpus State (WRK-1277, 2026-03-17)

Classification	Count	Percentage
native	623,455	60.3%
machine	278,899	27.0%
ocr-needed	92,042	8.9%
missing	27,476	2.7%
error	6,221	0.6%
mixed	5,246	0.5%
Total classified	1,033,933	96.7%

Error reduction: 296,626 → 6,221 (97.9% recovery). Remaining errors are genuine edge cases (corrupt PDFs, missing files, extremely complex documents).

Classification Method

Use pdftotext (poppler) for batch classification — not pdfplumber:

bash

# Classify all PDFs with parallel workers (resume-safe)
uv run --no-project python scripts/data/document-index/enrich-readability.py \
    --workers 10 --resume

Use --workers 10 for bulk enrichment to parallelize across CPU cores. The --resume flag skips already-classified entries, making it safe to restart after interruption.

WARNING (WRK-1277): The original enrich-readability.py used pdfplumber in ProcessPoolExecutor — this hung in D-state on NTFS/NFS mounts. The proven pattern is pdftotext via subprocess.run(timeout=30) with 8 workers (see pdf/pdftotext-poppler sub-skill for code). Throughput: ~49 files/sec vs ~1.3 with pdfplumber.

Maintainer

vamseeachanta Core maintainer

Source details

Full Name: vamseeachanta/workspace-hub
Branch: main
Path in repo: .claude/skills/data/documents/pdf/text-extractor/readability-classification

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

vamseeachanta/workspace-hub

gsd-complete-milestone

Archive completed milestone and prepare for next version

4 4

Explore

vamseeachanta/workspace-hub

gsd-reapply-patches

Reapply local modifications after a GSD update

4 4

Explore

vamseeachanta/workspace-hub

gsd-verify-work

Validate built features through conversational UAT

4 4

Explore

vamseeachanta/workspace-hub

gsd-thread

Manage persistent context threads for cross-session work

4 4

Explore

vamseeachanta/workspace-hub

clinical-trial-protocol

Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.

4 4

Explore

vamseeachanta/workspace-hub

single-cell-rna-qc

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.

4 4

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Readability Classification

Readability Classification

Final Corpus State (WRK-1277, 2026-03-17)

Classification Method

Recommended Agent Skills

gsd-complete-milestone

gsd-reapply-patches

gsd-verify-work

gsd-thread

clinical-trial-protocol

single-cell-rna-qc