Agent skill
document-inventory
Scan and catalog document collections with metadata extraction, categorization, and statistics. Use for auditing document libraries, preparing for knowledge base creation, or understanding large file collections.
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/document-inventory
SKILL.md
Document Inventory
Overview
This skill scans document collections (PDFs, Word docs, text files) and creates a structured inventory with metadata, automatic categorization, and collection statistics. Essential first step before building knowledge bases.
Quick Start
from pathlib import Path
import sqlite3
# Scan directory
documents = []
for filepath in Path("/path/to/docs").rglob("*.pdf"):
documents.append({
'filename': filepath.name,
'size': filepath.stat().st_size,
'path': str(filepath)
})
# Store in database
conn = sqlite3.connect("inventory.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS docs (name TEXT, size INTEGER, path TEXT)")
for doc in documents:
cursor.execute("INSERT INTO docs VALUES (?, ?, ?)",
(doc['filename'], doc['size'], doc['path']))
conn.commit()
print(f"Inventoried {len(documents)} documents")
When to Use
- Auditing large document libraries before processing
- Understanding the scope of a document collection
- Categorizing documents by type, source, or content
- Preparing inventories for knowledge base creation
- Generating reports on document collections
- Identifying duplicates or organizing files
Features
- Recursive scanning - Process nested directories
- Metadata extraction - Size, dates, page counts
- Auto-categorization - Pattern-based classification
- Statistics generation - Collection summaries
- SQLite storage - Queryable inventory database
- Multiple formats - PDF, DOCX, TXT, and more
Implementation
Core Inventory Builder
#!/usr/bin/env python3
"""Document inventory builder."""
import sqlite3
import os
from pathlib import Path
from datetime import datetime
import logging
*See sub-skills for full details.*
### CLI Interface
```python
#!/usr/bin/env python3
"""Document Inventory CLI."""
import argparse
import json
def main():
parser = argparse.ArgumentParser(description='Document Inventory Tool')
subparsers = parser.add_subparsers(dest='command', help='Commands')
*See sub-skills for full details.*
### Report Generator
```python
def generate_report(db_path, output_path):
"""Generate HTML inventory report."""
inventory = DocumentInventory(db_path)
stats = inventory.get_statistics()
html = f"""
<!DOCTYPE html>
<html>
<head>
*See sub-skills for full details.*
## Custom Categorization
### Extend with Your Patterns
```python
# Add custom patterns for your domain
CUSTOM_PATTERNS = {
'SPEC': 'Specifications',
'DWG': 'Drawings',
'REV': 'Revisions',
'APPROVED': 'Approved',
'DRAFT': 'Draft',
'SUPERSEDED': 'Superseded',
}
*See sub-skills for full details.*
### Multi-Level Categories
```python
def categorize_hierarchical(filepath):
"""Create hierarchical categories."""
name = filepath.name.upper()
# Primary category
primary = 'General'
if 'API' in name:
primary = 'API Standards'
elif 'ISO' in name:
*See sub-skills for full details.*
## Example Usage
```bash
# Scan directory
python inventory.py scan /path/to/documents --db inventory.db
# View statistics
python inventory.py stats --db inventory.db
# Search
python inventory.py search "API" --category "Standards"
# Export to CSV
python inventory.py export inventory.csv --db inventory.db
Related Skills
knowledge-base-builder- Build searchable database after inventorypdf/text-extractor- Extract text from inventoried PDFssemantic-search-setup- Add AI search capabilities
Version History
- 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with SQLite storage, auto-categorization, CLI interface
Sub-Skills
- Best Practices
Sub-Skills
- Execution Checklist
- Error Handling
- Metrics
- Dependencies
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?