Agent skill
Comprehensive PDF manipulation toolkit. For batch/bulk extraction (1K+ PDFs), use pdftotext (poppler) via subprocess — fastest and most reliable at scale. For single-document understanding, OpenAI Codex PDF-to-Markdown gives best results. Also supports text/table extraction, PDF creation, merging/splitting, and forms.
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/pdf
SKILL.md
Overview
This skill enables comprehensive PDF operations through Python libraries and command-line tools. Use it for reading, creating, modifying, and analyzing PDF documents.
Quick Start
from pypdf import PdfReader
reader = PdfReader("document.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
Tool Selection (WRK-1277 + WRK-1302 + WRK-1303 Learnings)
Scenario → Tool Mapping
| Scenario | Tool | Why |
|---|---|---|
| Batch extraction (1K+ PDFs) | pdftotext (poppler) via subprocess | Proven at 297K scale; reliable timeout via SIGTERM; subprocess isolation |
| Single-doc understanding | OpenAI Codex PDF→Markdown | Best quality; too expensive for bulk |
| Single-doc text extraction | PyMuPDF (fitz) | Fast, good API, in-process |
| Readability classification | pypdfium2 | Replaces pdfplumber for page sampling; no D-state hangs; Apache-2.0 license |
| Table extraction | pdfplumber (single doc only) | Best table detection; DO NOT use in multiprocessing pools |
| Structured markdown (tables+equations) | Docling (targeted use only) | MIT license; 1731 table rows from 6 docs; ~310s/doc on CPU |
| LLM/RAG markdown | pymupdf4llm (monitor only) | 0.12s/doc, good markdown; AGPL license blocks adoption |
Quality & Completeness Index (measured on dev-primary)
Scores: text completeness (% of content captured vs best-in-class), structure preservation, and batch viability. Based on WRK-1302 (243 PDFs) and WRK-1303 (6 PDFs).
| Tool | Text Completeness | Structure | Tables | Equations | Speed | Batch Safe | License |
|---|---|---|---|---|---|---|---|
| pdftotext (baseline) | 100% | none | none | unicode only | 0.02s | yes | GPL-2 |
| pypdfium2 | 86% | none | none | unicode only | 0.02s | no (thread-unsafe) | Apache-2.0 |
| pdfplumber | ~95% | partial | good (69-93%) | none | 0.10s | no (D-state) | MIT |
| Docling | 117% | full (md) | good (md rows) | unicode+context | 310s | no (CPU bound) | MIT |
| PyMuPDF (fitz) | ~98% | partial | basic | unicode only | 0.01s | yes | AGPL |
| pymupdf4llm | ~100% | full (md) | good (md) | unicode+context | 0.12s | untested | AGPL |
| Codex API | ~100% | full (md) | excellent | LaTeX | ~2s | no (API cost) | proprietary |
Column definitions:
- Text Completeness: chars extracted vs pdftotext baseline (WRK-1302: 243 PDFs, WRK-1303: 6 PDFs)
- Structure: none = raw text | partial = some layout | full = headings, lists, sections
- Tables: none | basic = cell text only | good = rows+cols preserved | excellent = multi-span
- Equations: unicode only = captures symbols | unicode+context = in structured output | LaTeX = formula markup
- Batch Safe: can run in ProcessPoolExecutor on NFS/NTFS without hangs or crashes
WARNING: pdfplumber hangs in kernel D-state (disk sleep) on NTFS and NFS mounts. SIGALRM cannot interrupt kernel I/O. Use pdftotext via
subprocess.run(timeout=N)for any batch/parallel work — the subprocess can be killed reliably on timeout.
Benchmarks:
scripts/data/doc_intelligence/benchmark_pdf_tools.py(WRK-1302),scripts/data/doc_intelligence/benchmark_docling.py(WRK-1303)
When to Use
- Batch PDF processing - Use pdftotext (poppler) via subprocess for bulk extraction
- Converting PDFs to Markdown - Use OpenAI Codex for intelligent conversion (single docs)
- Extracting text and metadata from PDF files
- Merging multiple PDFs into a single document
- Splitting large PDFs into individual pages
- Adding watermarks or annotations to PDFs
- Password-protecting or decrypting PDFs
- Extracting images from PDF documents
- OCR processing for scanned documents
- Creating new PDFs with reportlab
- Extracting tables from structured PDFs
Version History
- 1.2.2 (2026-01-04): Fixed P2 issue - added
parents=Trueto allmkdir()calls to handle nested output paths; prevents FileNotFoundError when creating directories with non-existent parent paths - 1.2.1 (2026-01-04): Fixed CLI tool missing imports - added complete standalone script with all required imports (openai, pypdf, logging) and function definitions; resolved P1 issue from Codex review
- 1.2.0 (2026-01-04): MAJOR UPDATE - Added OpenAI Codex integration for PDF-to-Markdown conversion as recommended first step for all PDF processing; includes batch conversion, chunking for large files, cost-effective options, and complete CLI tool
- 1.1.0 (2026-01-02): Added Quick Start, When to Use, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with pypdf, pdfplumber, reportlab, CLI tools
Sub-Skills
- Why Convert to Markdown First?
- OpenAI Codex Conversion
- pypdf - Core PDF Operations (+2)
- pdftotext (Poppler) (+2)
- Why Use PDF-Large-Reader? (+8)
- OCR for Scanned Documents (+3)
- Execution Checklist
- Common Errors
- Metrics
- Quick Reference
- Dependencies
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?