Agent skills
pdf-text-extractor

Agent skill

pdf-text-extractor

Extract text from PDF files with intelligent chunking and metadata preservation. For batch extraction (1K+ PDFs), use pdftotext (poppler) via subprocess — see pdf skill Tool Selection table. For single-doc quality, use Codex or PyMuPDF. Supports technical documents, standards libraries, research papers, or any PDF collection.

View SKILL.md on GitHub Repository

Stars 4

Forks 4

Install this agent skill to your Project

npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/pdf/text-extractor

SKILL.md

Pdf Text Extractor

Overview

This skill extracts text from PDF files using PyMuPDF (fitz), with intelligent chunking, page tracking, and metadata preservation. Handles large PDF collections with batch processing and error recovery.

Tool selection (see pdf skill Tool Selection table for full guidance):

Batch (1K+ PDFs): pdftotext (poppler) via subprocess.run(timeout=30) — 37x faster, reliable timeouts
Single doc quality: OpenAI Codex PDF→Markdown (best understanding)
Single doc text: PyMuPDF (fitz) — fast, good API

WARNING (WRK-1277): Do NOT use pdfplumber in multiprocessing pools. It hangs in kernel D-state on NTFS/NFS mounts — uninterruptible by SIGALRM. Use pdftotext via subprocess for all parallel/batch work.

Note: The doc-intelligence pipeline uses pdfplumber for single-document extraction. For bulk extraction across the 1M+ corpus, use pdftotext via subprocess (see pdf/pdftotext-poppler sub-skill for the proven batch pattern).

Quick Start

Recommended Approach (with Codex conversion):

python

# 1. Convert PDF to markdown first (see pdf skill)
from pdf_skill import pdf_to_markdown_codex

md_path = pdf_to_markdown_codex("document.pdf")

# 2. Process the markdown
with open(md_path) as f:
    markdown = f.read()
    # Work with structured markdown

Direct Extraction (when Codex not needed):

python

import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page in doc:
    text = page.get_text()
    print(text)
doc.close()

When to Use

Processing PDF document collections for search indexing
Extracting text from technical standards and specifications
Converting PDF libraries to searchable text databases
Preparing documents for AI/ML processing
Building knowledge bases from PDF archives

Related Skills

knowledge-base-builder - Build searchable database from extracted text
semantic-search-setup - Add vector embeddings for AI search
document-inventory - Catalog documents before extraction

Version History

1.3.0 (2026-03-17): WRK-1277 learnings — pdftotext preferred for batch; D-state/NFS/NTFS warnings; fixed duplicate Sub-Skills sections; updated tool selection guidance
1.2.0 (2026-01-04): Added OpenAI Codex workflow recommendation as preferred approach; updated Quick Start to show Codex-first workflow; added reference to pdf skill for markdown conversion
1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with PyMuPDF, batch processing, OCR support, metadata extraction

Sub-Skills

Best Practices
Execution Checklist
Error Handling
Metrics
Dependencies (+5)
Features
Readability Classification
Encrypted PDFs (+2)
Example Usage

Maintainer

vamseeachanta Core maintainer

Source details

Full Name: vamseeachanta/workspace-hub
Branch: main
Path in repo: .claude/skills/data/documents/pdf/text-extractor

Featured Tools

Join Our Newsletter

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.

4 4

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Pdf Text Extractor

Overview

Quick Start

When to Use

Related Skills

Version History

Sub-Skills

Recommended Agent Skills

gsd-complete-milestone

gsd-reapply-patches

gsd-verify-work

gsd-thread

clinical-trial-protocol

single-cell-rna-qc