Agent skill

large-document-processing

Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Use when working with complex formatted documents, multi-level hierarchies, or when you need to extract structured data from large files like PDFs, DOCX, or text files.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/large-document-processing

SKILL.md

Large Document Processing

Overview

A comprehensive skill for processing large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Designed for documents with complex formatting, hierarchical structures, and multi-level indentation.

Capabilities

  • Multi-format Support: DOCX, PDF, and text files
  • Structure Preservation: Maintains document hierarchy, indentation, and formatting
  • Memory Efficiency: Chunked processing to handle very large documents
  • Intelligent Parsing: Recognizes headings, lists, dictionary entries, and semantic boundaries
  • Progress Tracking: Real-time processing status and error recovery
  • Metadata Extraction: Comprehensive document analysis and statistics

Core Components

1. Advanced Document Parser

Parse complex document structures while preserving formatting and hierarchy.

Key Features:

  • Hierarchical structure detection (levels 1-10)
  • Formatting preservation (bold, italic, fonts, sizes)
  • Page-by-page processing for memory efficiency
  • Intelligent content classification
  • Multi-language support with accent character handling

2. Implementation Pattern

python
from .large_document_processor import LargeDocumentProcessor, ProcessingConfig

# Configure processing
config = ProcessingConfig(
    chunk_size_pages=50,
    parallel_workers=4,
    preserve_formatting=True
)

# Initialize processor
processor = LargeDocumentProcessor(config)

# Process document
results = processor.process_large_document(
    input_file="large_document.docx",
    output_dir="output/processed"
)

3. Intelligent Text Chunking

python
from .intelligent_chunker import IntelligentTextChunker, ChunkType

chunker = IntelligentTextChunker(
    max_chunk_size=1024,
    overlap_ratio=0.15,
    preserve_sentences=True
)

chunks = chunker.chunk_document(text, ChunkType.SEMANTIC)

Output Formats

  • Structured JSON: Complete document hierarchy and metadata
  • Plain text: Clean extracted text with optional formatting markers
  • Chunked data: AI-ready text segments with overlap and metadata
  • Statistics report: Processing metrics and quality analysis

Best Practices

  1. Memory Management: Use chunked processing for documents >100MB
  2. Parallel Processing: Leverage multiple workers for batch operations
  3. Structure Validation: Verify hierarchy detection accuracy
  4. Progress Tracking: Provide user feedback for long-running operations

Dependencies

  • python-docx: DOCX file processing
  • PyMuPDF: Advanced PDF processing
  • Pillow: Image processing for embedded content
  • pathlib: Cross-platform path handling

Didn't find tool you were looking for?

Be as detailed as possible for better results