Agent skill
pdf-why-use-pdf-large-reader
Sub-skill of pdf: Why Use PDF-Large-Reader? (+8).
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/_archive/data/documents/pdf/why-use-pdf-large-reader
SKILL.md
Why Use PDF-Large-Reader? (+8)
Why Use PDF-Large-Reader?
- Memory-efficient - Handles 100MB+ PDFs without memory issues
- Robust table extraction - Handles irregular tables with column count normalization
- Multiple output formats - Generator (streaming), List, or Plain Text
- Automatic strategy selection - Intelligent chunk size calculation
- Complete extraction - Text, images, tables, and metadata in one pass
- High test coverage - 93.58% coverage with 215 tests
Installation
# From the pdf-large-reader repository
cd /mnt/github/workspace-hub/pdf-large-reader
pip install -e .
# Or with extras
pip install -e ".[dev,progress]"
Quick Start
from pdf_large_reader import process_large_pdf, extract_text_only, extract_everything
# Simple text extraction
text = extract_text_only("large_document.pdf")
print(text)
# Process with automatic strategy selection
pages = process_large_pdf(
"large_document.pdf",
output_format="list",
extract_images=True,
extract_tables=True
)
# Memory-efficient streaming for very large files
for page in process_large_pdf("huge_file.pdf", output_format="generator"):
print(f"Page {page.page_number}: {len(page.text)} characters")
Robust Table Extraction
NEW: Column Count Normalization (v1.3.0+)
The table extraction now handles irregular tables with different column counts:
from pdf_large_reader import extract_everything
# Extract everything including tables with robust error handling
pages = extract_everything("technical_standard.pdf")
for page in pages:
if 'tables' in page.metadata:
tables = page.metadata['tables']
print(f"Page {page.page_number}: Found {len(tables)} tables")
for i, table_df in enumerate(tables):
print(f" Table {i+1}: {table_df.shape[0]} rows x {table_df.shape[1]} cols")
print(table_df.head())
How It Works:
- Detects table-like structures from text positioning
- Normalizes column counts across all rows
- Pads short rows with empty strings
- Gracefully handles malformed tables with try-except
- Logs warnings instead of crashing
Typical Performance:
- API Std 650 (28 MB, 461 pages): 14,648 chars/sec, 5.18 pages/sec
- API RP 579 (41 MB, 966 pages): 2,090 chars/sec, 8.48 pages/sec
Command Line Usage
# Extract text from PDF
pdf-large-reader document.pdf
# Save to file
pdf-large-reader document.pdf --output result.txt
# Extract with images and tables
pdf-large-reader document.pdf --extract-images --extract-tables
# Use generator format for large files
pdf-large-reader huge.pdf --output-format generator
# Verbose output
pdf-large-reader document.pdf --verbose
API Reference
# Main entry point with automatic strategy
process_large_pdf(
pdf_path,
output_format="generator", # "generator" (default), "list", or "text"
extract_images=False, # Extract images
extract_tables=False, # Extract tables with normalization
chunk_size=None, # Auto-calculated if None
fallback_api_key=None, # OpenAI API key for complex pages
fallback_model="gpt-4.1", # Model for fallback extraction
progress_callback=None, # Progress tracking function
auto_strategy=True # Enable automatic strategy selection
)
# Quick text extraction
extract_text_only(pdf_path) -> str
# Extract with images
extract_pages_with_images(pdf_path) -> List[PDFPage]
# Extract with tables
extract_pages_with_tables(pdf_path) -> List[PDFPage]
# Extract everything
extract_everything(pdf_path) -> List[PDFPage]
PDFPage Data Structure
@dataclass
class PDFPage:
page_number: int # Page number (1-indexed)
text: str # Extracted text from page
images: List[dict] # Extracted images with metadata
metadata: dict # Page metadata including tables
Performance Benchmarks
Tested on Ubuntu 22.04, Python 3.11, 16GB RAM:
| File Size | Pages | Time | Memory | Strategy |
|---|---|---|---|---|
| 5 MB | 10 | < 5s | ~50 MB | batch_all |
| 50 MB | 100 | < 30s | ~150 MB | chunked |
| 100 MB | 500 | < 60s | ~200 MB | stream_pages |
| 200 MB | 1000 | < 2min | ~250 MB | stream_pages |
Real-World Validation
Tested with actual API standards:
- ✅ API RP 579 (2000) - 41 MB, 966 pages
- ✅ API Std 650 (2001) - 28 MB, 461 pages
- ✅ All extraction methods working (text, auto strategy, generator, complete)
- ✅ Table extraction with column normalization
- ✅ Image extraction (461-966 images per document)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?