Agent skills
large-document-processing

Agent skill

large-document-processing

Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Use when working with complex formatted documents, multi-level hierarchies, or when you need to extract structured data from large files like PDFs, DOCX, or text files.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/large-document-processing

SKILL.md

Large Document Processing

Overview

A comprehensive skill for processing large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Designed for documents with complex formatting, hierarchical structures, and multi-level indentation.

Capabilities

Multi-format Support: DOCX, PDF, and text files
Structure Preservation: Maintains document hierarchy, indentation, and formatting
Memory Efficiency: Chunked processing to handle very large documents
Intelligent Parsing: Recognizes headings, lists, dictionary entries, and semantic boundaries
Progress Tracking: Real-time processing status and error recovery
Metadata Extraction: Comprehensive document analysis and statistics

Core Components

1. Advanced Document Parser

Parse complex document structures while preserving formatting and hierarchy.

Key Features:

Hierarchical structure detection (levels 1-10)
Formatting preservation (bold, italic, fonts, sizes)
Page-by-page processing for memory efficiency
Intelligent content classification
Multi-language support with accent character handling

2. Implementation Pattern

python

from .large_document_processor import LargeDocumentProcessor, ProcessingConfig

# Configure processing
config = ProcessingConfig(
    chunk_size_pages=50,
    parallel_workers=4,
    preserve_formatting=True
)

# Initialize processor
processor = LargeDocumentProcessor(config)

# Process document
results = processor.process_large_document(
    input_file="large_document.docx",
    output_dir="output/processed"
)

3. Intelligent Text Chunking

python

from .intelligent_chunker import IntelligentTextChunker, ChunkType

chunker = IntelligentTextChunker(
    max_chunk_size=1024,
    overlap_ratio=0.15,
    preserve_sentences=True
)

chunks = chunker.chunk_document(text, ChunkType.SEMANTIC)

Output Formats

Structured JSON: Complete document hierarchy and metadata
Plain text: Clean extracted text with optional formatting markers
Chunked data: AI-ready text segments with overlap and metadata
Statistics report: Processing metrics and quality analysis

Best Practices

Memory Management: Use chunked processing for documents >100MB
Parallel Processing: Leverage multiple workers for batch operations
Structure Validation: Verify hierarchy detection accuracy
Progress Tracking: Provide user feedback for long-running operations

Dependencies

python-docx: DOCX file processing
PyMuPDF: Advanced PDF processing
Pillow: Image processing for embedded content
pathlib: Cross-platform path handling

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/large-document-processing
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Large Document Processing

Overview

Capabilities

Core Components

1. Advanced Document Parser

2. Implementation Pattern

3. Intelligent Text Chunking

Output Formats

Best Practices

Dependencies

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state