Agent skill
large-document-processing
Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Use when working with complex formatted documents, multi-level hierarchies, or when you need to extract structured data from large files like PDFs, DOCX, or text files.
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/large-document-processing
SKILL.md
Large Document Processing
Overview
A comprehensive skill for processing large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Designed for documents with complex formatting, hierarchical structures, and multi-level indentation.
Capabilities
- Multi-format Support: DOCX, PDF, and text files
- Structure Preservation: Maintains document hierarchy, indentation, and formatting
- Memory Efficiency: Chunked processing to handle very large documents
- Intelligent Parsing: Recognizes headings, lists, dictionary entries, and semantic boundaries
- Progress Tracking: Real-time processing status and error recovery
- Metadata Extraction: Comprehensive document analysis and statistics
Core Components
1. Advanced Document Parser
Parse complex document structures while preserving formatting and hierarchy.
Key Features:
- Hierarchical structure detection (levels 1-10)
- Formatting preservation (bold, italic, fonts, sizes)
- Page-by-page processing for memory efficiency
- Intelligent content classification
- Multi-language support with accent character handling
2. Implementation Pattern
python
from .large_document_processor import LargeDocumentProcessor, ProcessingConfig
# Configure processing
config = ProcessingConfig(
chunk_size_pages=50,
parallel_workers=4,
preserve_formatting=True
)
# Initialize processor
processor = LargeDocumentProcessor(config)
# Process document
results = processor.process_large_document(
input_file="large_document.docx",
output_dir="output/processed"
)
3. Intelligent Text Chunking
python
from .intelligent_chunker import IntelligentTextChunker, ChunkType
chunker = IntelligentTextChunker(
max_chunk_size=1024,
overlap_ratio=0.15,
preserve_sentences=True
)
chunks = chunker.chunk_document(text, ChunkType.SEMANTIC)
Output Formats
- Structured JSON: Complete document hierarchy and metadata
- Plain text: Clean extracted text with optional formatting markers
- Chunked data: AI-ready text segments with overlap and metadata
- Statistics report: Processing metrics and quality analysis
Best Practices
- Memory Management: Use chunked processing for documents >100MB
- Parallel Processing: Leverage multiple workers for batch operations
- Structure Validation: Verify hierarchy detection accuracy
- Progress Tracking: Provide user feedback for long-running operations
Dependencies
python-docx: DOCX file processingPyMuPDF: Advanced PDF processingPillow: Image processing for embedded contentpathlib: Cross-platform path handling
Didn't find tool you were looking for?