Agent skill

pdf-text-extractor

Extract text from PDF files with intelligent chunking and metadata preservation. For batch extraction (1K+ PDFs), use pdftotext (poppler) via subprocess — see pdf skill Tool Selection table. For single-doc quality, use Codex or PyMuPDF. Supports technical documents, standards libraries, research papers, or any PDF collection.

Stars 4
Forks 4

Install this agent skill to your Project

npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/pdf/text-extractor

SKILL.md

Pdf Text Extractor

Overview

This skill extracts text from PDF files using PyMuPDF (fitz), with intelligent chunking, page tracking, and metadata preservation. Handles large PDF collections with batch processing and error recovery.

Tool selection (see pdf skill Tool Selection table for full guidance):

  • Batch (1K+ PDFs): pdftotext (poppler) via subprocess.run(timeout=30) — 37x faster, reliable timeouts
  • Single doc quality: OpenAI Codex PDF→Markdown (best understanding)
  • Single doc text: PyMuPDF (fitz) — fast, good API

WARNING (WRK-1277): Do NOT use pdfplumber in multiprocessing pools. It hangs in kernel D-state on NTFS/NFS mounts — uninterruptible by SIGALRM. Use pdftotext via subprocess for all parallel/batch work.

Note: The doc-intelligence pipeline uses pdfplumber for single-document extraction. For bulk extraction across the 1M+ corpus, use pdftotext via subprocess (see pdf/pdftotext-poppler sub-skill for the proven batch pattern).

Quick Start

Recommended Approach (with Codex conversion):

python
# 1. Convert PDF to markdown first (see pdf skill)
from pdf_skill import pdf_to_markdown_codex

md_path = pdf_to_markdown_codex("document.pdf")

# 2. Process the markdown
with open(md_path) as f:
    markdown = f.read()
    # Work with structured markdown

Direct Extraction (when Codex not needed):

python
import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page in doc:
    text = page.get_text()
    print(text)
doc.close()

When to Use

  • Processing PDF document collections for search indexing
  • Extracting text from technical standards and specifications
  • Converting PDF libraries to searchable text databases
  • Preparing documents for AI/ML processing
  • Building knowledge bases from PDF archives

Related Skills

  • knowledge-base-builder - Build searchable database from extracted text
  • semantic-search-setup - Add vector embeddings for AI search
  • document-inventory - Catalog documents before extraction

Version History

  • 1.3.0 (2026-03-17): WRK-1277 learnings — pdftotext preferred for batch; D-state/NFS/NTFS warnings; fixed duplicate Sub-Skills sections; updated tool selection guidance
  • 1.2.0 (2026-01-04): Added OpenAI Codex workflow recommendation as preferred approach; updated Quick Start to show Codex-first workflow; added reference to pdf skill for markdown conversion
  • 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
  • 1.0.0 (2024-10-15): Initial release with PyMuPDF, batch processing, OCR support, metadata extraction

Sub-Skills

  • Best Practices
  • Execution Checklist
  • Error Handling
  • Metrics
  • Dependencies (+5)
  • Features
  • Readability Classification
  • Encrypted PDFs (+2)
  • Example Usage

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results