Agent skill
document-rag-pipeline
Build complete document knowledge bases with PDF text extraction, OCR for scanned documents, vector embeddings, and semantic search. Use this for creating searchable document libraries from folders of PDFs, technical standards, or any document collection.
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/document-rag-pipeline
SKILL.md
Document Rag Pipeline
Overview
This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:
- Regular PDF text extraction
- OCR for scanned/image-based PDFs
- DRM-protected file detection
- Text chunking with overlap
- Vector embedding generation
- SQLite storage with full-text search
- Semantic similarity search
Quick Start
# Install dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
# Build knowledge base
python build_knowledge_base.py /path/to/documents --embed
# Search documents
python build_knowledge_base.py /path/to/documents --search "your query"
When to Use
- Building searchable knowledge bases from document folders
- Processing technical standards libraries (API, ISO, ASME, etc.)
- Creating semantic search over engineering documents
- OCR processing of scanned historical documents
- Any collection of PDFs needing intelligent search
Prerequisites
System Dependencies
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils
# macOS
brew install tesseract poppler
# Verify Tesseract
tesseract --version # Should show 5.x
Python Dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Or with UV:
uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Related Skills
pdf/text-extractor- Just text extractionsemantic-search-setup- Just embeddings/searchrag-system-builder- Add LLM Q&A layerknowledge-base-builder- Simpler document catalog
Version History
- 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search
Sub-Skills
- Build Knowledge Base (+2)
Sub-Skills
- Execution Checklist
- Error Handling
- Metrics
Sub-Skills
- Architecture
- Step 1: Database Schema (+5)
- Complete Pipeline Script
- Performance Metrics (Real-World)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?