Agent skills
document-rag-pipeline

Agent skill

document-rag-pipeline

Build complete document knowledge bases with PDF text extraction, OCR for scanned documents, vector embeddings, and semantic search. Use this for creating searchable document libraries from folders of PDFs, technical standards, or any document collection.

View SKILL.md on GitHub Repository

Stars 4

Forks 4

Install this agent skill to your Project

npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/documents/document-rag-pipeline

SKILL.md

Document Rag Pipeline

Overview

This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:

Regular PDF text extraction
OCR for scanned/image-based PDFs
DRM-protected file detection
Text chunking with overlap
Vector embedding generation
SQLite storage with full-text search
Semantic similarity search

Quick Start

bash

# Install dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

# Build knowledge base
python build_knowledge_base.py /path/to/documents --embed

# Search documents
python build_knowledge_base.py /path/to/documents --search "your query"

When to Use

Building searchable knowledge bases from document folders
Processing technical standards libraries (API, ISO, ASME, etc.)
Creating semantic search over engineering documents
OCR processing of scanned historical documents
Any collection of PDFs needing intelligent search

Prerequisites

System Dependencies

bash

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils

# macOS
brew install tesseract poppler

# Verify Tesseract
tesseract --version  # Should show 5.x

Python Dependencies

bash

pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Or with UV:

bash

uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm

Related Skills

pdf/text-extractor - Just text extraction
semantic-search-setup - Just embeddings/search
rag-system-builder - Add LLM Q&A layer
knowledge-base-builder - Simpler document catalog

Version History

1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search

Sub-Skills

Build Knowledge Base (+2)

Sub-Skills

Execution Checklist
Error Handling
Metrics

Sub-Skills

Architecture
Step 1: Database Schema (+5)
Complete Pipeline Script
Performance Metrics (Real-World)

Maintainer

vamseeachanta Core maintainer

Source details

Full Name: vamseeachanta/workspace-hub
Branch: main
Path in repo: .claude/skills/data/documents/document-rag-pipeline

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

vamseeachanta/workspace-hub

gsd-complete-milestone

Archive completed milestone and prepare for next version

4 4

Explore

vamseeachanta/workspace-hub

gsd-reapply-patches

Reapply local modifications after a GSD update

4 4

Explore

vamseeachanta/workspace-hub

gsd-verify-work

Validate built features through conversational UAT

4 4

Explore

vamseeachanta/workspace-hub

gsd-thread

Manage persistent context threads for cross-session work

4 4

Explore

vamseeachanta/workspace-hub

clinical-trial-protocol

Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.

4 4

Explore

vamseeachanta/workspace-hub

single-cell-rna-qc

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.

4 4

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Document Rag Pipeline

Overview

Quick Start

When to Use

Prerequisites

System Dependencies

Python Dependencies

Related Skills

Version History

Sub-Skills

Sub-Skills

Sub-Skills

Recommended Agent Skills

gsd-complete-milestone

gsd-reapply-patches

gsd-verify-work

gsd-thread

clinical-trial-protocol

single-cell-rna-qc