Agent skill
rag-architect
Designs production-grade RAG pipelines with chunking optimization, retrieval evaluation, and pipeline architecture. Use when building a RAG system, selecting a chunking strategy, choosing a vector database, optimizing retrieval quality, designing embedding pipelines, or evaluating RAG performance with RAGAS metrics.
Install this agent skill to your Project
npx add-skill https://github.com/borghei/Claude-Skills/tree/main/engineering/rag-architect
Metadata
Additional technical details for this skill
- tier
- POWERFUL
- author
- borghei
- domain
- ai-ml
- updated
- 1774915200
- version
- 1.0.0
- category
- engineering
SKILL.md
RAG Architect
The agent designs, implements, and optimizes production-grade Retrieval-Augmented Generation pipelines, covering the full lifecycle from document chunking through evaluation.
Workflow
- Analyse corpus -- Profile the document collection: count, average length, format mix (PDF, HTML, Markdown), language(s), and domain. Validate that sample documents are accessible before proceeding.
- Select chunking strategy -- Choose from the Chunking Strategy Matrix based on corpus characteristics. Set chunk size, overlap, and boundary rules. Run a test split on 100 sample documents.
- Choose embedding model -- Select an embedding model from the Embedding Model table based on domain, latency budget, and cost constraints. Verify dimension compatibility with the target vector database.
- Select vector database -- Pick a vector store from the Vector Database Comparison based on scale, query patterns, and operational requirements.
- Design retrieval pipeline -- Configure retrieval strategy (dense, sparse, or hybrid). Add reranking if precision requirements exceed 0.85. Set the top-K parameter and similarity threshold.
- Implement query transformations -- If query-document style mismatch exists, enable HyDE. If queries are ambiguous, enable multi-query generation. Validate each transformation improves retrieval metrics on a held-out set.
- Configure guardrails -- Enable PII detection, toxicity filtering, hallucination detection, and source attribution. Set confidence score thresholds.
- Evaluate end-to-end -- Run the RAGAS evaluation framework. Verify faithfulness > 0.90, context relevance > 0.80, answer relevance > 0.85. Iterate on weak components.
Chunking Strategy Matrix
| Strategy | Best For | Chunk Size | Overlap | Pros | Cons |
|---|---|---|---|---|---|
| Fixed-size (token) | Uniform docs, consistent sizing | 512-2048 tokens | 10-20% | Predictable, simple | Breaks semantic units |
| Sentence-based | Narrative text, articles | 3-8 sentences | 1 sentence | Preserves language boundaries | Variable sizes |
| Paragraph-based | Structured docs, technical manuals | 1-3 paragraphs | 0-1 paragraph | Preserves topic coherence | Highly variable sizes |
| Semantic | Long-form, research papers | Dynamic | Topic-shift detection | Best coherence | Computationally expensive |
| Recursive | Mixed content types | Dynamic, multi-level | Per-level | Optimal utilization | Complex implementation |
| Document-aware | Multi-format collections | Format-specific | Section-level | Preserves metadata | Format-specific code required |
Embedding Model Comparison
| Model | Dimensions | Speed | Quality | Cost | Best For |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ~14K tok/s | Good | Free (local) | Prototyping, low-latency |
| all-mpnet-base-v2 | 768 | ~2.8K tok/s | Better | Free (local) | Balanced production use |
| text-embedding-3-small | 1536 | API | High | $0.02/1M tokens | Cost-effective production |
| text-embedding-3-large | 3072 | API | Highest | $0.13/1M tokens | Maximum quality |
| Domain fine-tuned | Varies | Varies | Domain-best | Training cost | Specialized domains (legal, medical) |
Vector Database Comparison
| Database | Type | Scaling | Key Feature | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Auto-scaling | Metadata filtering, hybrid search | Production, managed preference |
| Weaviate | Open source | Horizontal | GraphQL API, multi-modal | Complex data types |
| Qdrant | Open source | Distributed | High perf, low memory (Rust) | Performance-critical |
| Chroma | Embedded | Limited | Simple API, SQLite-backed | Prototyping, small-scale |
| pgvector | PostgreSQL ext | PostgreSQL scaling | ACID, SQL joins | Existing PostgreSQL infra |
Retrieval Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| Dense (vector similarity) | Default for semantic search | Cosine similarity with k-NN/ANN |
| Sparse (BM25/TF-IDF) | Exact keyword matching needed | Elasticsearch or inverted index |
| Hybrid (dense + sparse) | Best of both needed | Reciprocal Rank Fusion (RRF) with tuned weights |
| + Reranking | Precision must exceed 0.85 | Cross-encoder reranker after initial retrieval |
Query Transformation Techniques
| Technique | When to Use | How It Works |
|---|---|---|
| HyDE | Query/document style mismatch | LLM generates hypothetical answer; embed that instead of query |
| Multi-query | Ambiguous queries | Generate 3-5 query variations; retrieve for each; deduplicate |
| Step-back | Specific questions needing general context | Transform to broader query; retrieve general + specific |
Context Window Optimization
- Relevance ordering: Most relevant chunks first in the context window
- Diversity: Deduplicate semantically similar chunks
- Token budget: Fit within model context limit; reserve tokens for system prompt and answer
- Hierarchical inclusion: Include section summary before detailed chunks when available
- Compression: Summarize low-relevance chunks; extract key facts from verbose passages
Evaluation Metrics (RAGAS Framework)
| Metric | Target | What It Measures |
|---|---|---|
| Faithfulness | > 0.90 | Answers grounded in retrieved context |
| Context Relevance | > 0.80 | Retrieved chunks relevant to query |
| Answer Relevance | > 0.85 | Answer addresses the original question |
| Precision@K | > 0.70 | % of top-K results that are relevant |
| Recall@K | > 0.80 | % of relevant docs found in top-K |
| MRR | > 0.75 | Reciprocal rank of first relevant result |
Guardrails
- PII detection: Scan retrieved chunks and generated responses for PII; redact or block
- Hallucination detection: Compare generated claims against source documents via NLI
- Source attribution: Every factual claim must cite a retrieved chunk
- Confidence scoring: Return confidence level; if below threshold, return "I don't have enough information"
- Injection prevention: Sanitize user queries; reject prompt injection attempts
Example: Internal Knowledge Base RAG Pipeline
corpus:
documents: 12,000 Confluence pages + 3,000 PDFs
avg_length: 2,400 tokens
languages: [English]
domain: internal engineering docs
pipeline:
chunking:
strategy: recursive
max_tokens: 512
overlap: 50 tokens
boundary: paragraph
embedding:
model: text-embedding-3-small
dimensions: 1536
batch_size: 100
vector_db:
engine: pgvector
index: HNSW (ef_construction=128, m=16)
reason: "Existing PostgreSQL infra; ACID compliance for audit"
retrieval:
strategy: hybrid
dense_weight: 0.7
sparse_weight: 0.3
top_k: 10
reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
final_k: 5
evaluation_results:
faithfulness: 0.93
context_relevance: 0.84
answer_relevance: 0.88
precision_at_5: 0.76
recall_at_10: 0.85
Production Patterns
- Caching: Query-level (exact match), semantic (similar queries via embedding distance < 0.05), chunk-level (embedding cache)
- Streaming: Stream generation tokens while retrieval completes; show sources after generation
- Fallbacks: If primary vector DB is unavailable, serve from read-replica; if retrieval returns no results above threshold, say so explicitly
- Document refresh: Incremental re-embedding on change detection; full re-index weekly
- Cost control: Batch embeddings, cache aggressively, route simple queries to BM25 only
Common Pitfalls
| Problem | Solution |
|---|---|
| Chunks break mid-sentence | Use boundary-aware chunking with sentence/paragraph overlap |
| Low retrieval precision | Add cross-encoder reranker; tune similarity threshold |
| High latency (> 2s) | Cache embeddings; use faster model; reduce top-K |
| Inconsistent quality | Implement RAGAS evaluation in CI; add quality scoring |
| Scalability bottleneck | Shard vector DB; implement auto-scaling; add caching layer |
Scripts
Chunking Optimizer
Analyses corpus and recommends optimal chunking strategy with parameters.
Retrieval Evaluator
Runs evaluation suite (precision, recall, MRR, NDCG) against a test query set.
Pipeline Benchmarker
Measures end-to-end latency, throughput, and cost per query across configurations.
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Chunks contain incomplete sentences or broken code blocks | Fixed-size chunking ignoring semantic boundaries | Switch to sentence-based or semantic (heading-aware) chunking; enable boundary detection in chunking_optimizer.py |
| Retrieved context is relevant but answer is wrong | LLM hallucinating beyond retrieved chunks | Enable faithfulness evaluation via RAGAS; add source attribution guardrails; lower confidence threshold to surface "I don't know" responses |
| Precision@K below 0.50 despite relevant documents existing | Embedding model does not capture domain vocabulary | Fine-tune embedding model on domain data or switch to a domain-specific model; add cross-encoder reranking stage |
| Query latency exceeds 2 seconds | Large top-K, no caching, or unoptimized HNSW index | Reduce top-K, enable query-level and semantic caching, tune HNSW parameters (ef_search, m) |
| Recall drops after adding new documents | Stale embeddings or index fragmentation after incremental inserts | Trigger full re-index; verify new documents pass chunking pipeline; check embedding model version consistency |
| Hybrid retrieval returns duplicate chunks | Dense and sparse retrievers returning overlapping results without deduplication | Apply Reciprocal Rank Fusion (RRF) with deduplication before reranking; tune dense/sparse weight ratio |
| Evaluation metrics fluctuate across runs | Non-deterministic embedding batching or insufficient test query set | Fix random seeds, increase evaluation sample size, run evaluations on a frozen ground-truth set |
Success Criteria
- Faithfulness > 0.90 -- Generated answers are grounded in retrieved context as measured by the RAGAS faithfulness metric.
- Context Relevance > 0.80 -- At least 80% of retrieved chunks are relevant to the user query.
- Precision@5 > 0.70 -- Seven out of ten top-5 result sets contain only relevant documents.
- End-to-end latency < 500ms -- P95 query-to-response latency stays under 500 milliseconds for interactive workloads.
- Recall@10 > 0.85 -- The system retrieves at least 85% of relevant documents within the top 10 results.
- Chunk boundary quality > 0.80 -- At least 80% of chunks end on clean sentence or paragraph boundaries as reported by
chunking_optimizer.py. - Monthly cost within budget -- Total embedding, vector DB, and reranking costs stay within the budget ceiling defined in requirements.
Scope & Limitations
This skill covers:
- End-to-end RAG pipeline architecture design: chunking, embedding, vector storage, retrieval, reranking, and evaluation.
- Quantitative chunking analysis across four strategy families (fixed-size, sentence, paragraph, semantic).
- Retrieval quality evaluation using standard IR metrics (Precision@K, Recall@K, MRR, NDCG) with a built-in TF-IDF baseline.
- Automated pipeline design with component selection, cost projection, and Mermaid architecture diagrams.
This skill does NOT cover:
- LLM prompt engineering or generation-side optimization -- see
engineering/prompt-engineer-toolkit. - Database schema design for metadata stores alongside vector databases -- see
engineering/database-designer. - Production observability, alerting, and SLO dashboards for deployed pipelines -- see
engineering/observability-designer. - Agent orchestration or multi-step reasoning workflows that sit on top of RAG retrieval -- see
engineering/agent-workflow-designer.
Integration Points
| Skill | Integration | Data Flow |
|---|---|---|
engineering/prompt-engineer-toolkit |
Optimize system prompts and few-shot examples fed alongside retrieved chunks | Pipeline design output --> prompt templates that reference chunk format and metadata |
engineering/database-designer |
Design relational metadata stores (tags, access control, source tracking) paired with the vector database | Vector DB recommendation --> metadata schema for hybrid storage |
engineering/observability-designer |
Set up latency, throughput, and accuracy monitoring for the deployed RAG pipeline | Evaluation metrics and SLO targets --> dashboards and alerting rules |
engineering/agent-workflow-designer |
Embed the RAG retrieval step inside multi-agent reasoning workflows | Retrieval config --> agent tool definition with top-K and threshold parameters |
engineering/ci-cd-pipeline-builder |
Automate embedding re-indexing, evaluation regression tests, and deployment on document changes | Evaluation thresholds --> CI gate that blocks deploys when metrics regress |
engineering/api-design-reviewer |
Review the query and ingestion API surface exposed by the RAG service | Pipeline config --> OpenAPI spec review for search and ingest endpoints |
Tool Reference
chunking_optimizer.py
Purpose: Analyzes a document corpus and evaluates multiple chunking strategies (fixed-size, sentence-based, paragraph-based, semantic/heading-aware) to recommend the optimal approach with configuration parameters.
Usage:
python chunking_optimizer.py <directory> [options]
Flags / Parameters:
| Flag | Type | Default | Description |
|---|---|---|---|
directory |
positional, required | -- | Directory containing text/markdown documents to analyze |
--output, -o |
string | None | Output file path for results in JSON format |
--config, -c |
string | None | JSON configuration file to customize strategy parameters (fixed_sizes, overlaps, sentence_max_sizes, paragraph_max_sizes, semantic_max_sizes) |
--extensions |
string list | .txt .md .markdown |
File extensions to include when scanning the corpus |
--verbose, -v |
flag | off | Print all strategy scores in addition to the recommendation |
Example:
python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose
Output Formats:
- Console -- Corpus summary, recommended strategy name, performance score, reasoning text, and two sample chunks. With
--verbose, all strategy scores are listed. - JSON (
--output) -- Full results object containingcorpus_info,strategy_results(per-strategy size statistics, boundary quality, semantic coherence, vocabulary statistics, performance score),recommendation(best strategy, all scores, reasoning), andsample_chunks.
retrieval_evaluator.py
Purpose: Evaluates retrieval system performance using a built-in TF-IDF baseline retriever and standard information retrieval metrics: Precision@K, Recall@K, MRR, and NDCG. Includes failure analysis and improvement recommendations.
Usage:
python retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]
Flags / Parameters:
| Flag | Type | Default | Description |
|---|---|---|---|
queries |
positional, required | -- | JSON file containing queries (list of {"id": ..., "query": ...} objects, or {"queries": [...]}) |
corpus |
positional, required | -- | Directory containing the document corpus |
ground_truth |
positional, required | -- | JSON file mapping query IDs to lists of relevant document IDs |
--output, -o |
string | None | Output file path for results in JSON format |
--k-values |
int list | 1 3 5 10 |
K values used when computing Precision@K, Recall@K, and NDCG@K |
--extensions |
string list | .txt .md .markdown |
File extensions to include from the corpus directory |
--verbose, -v |
flag | off | Print detailed per-metric values and failure analysis counts |
Example:
python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose
Output Formats:
- Console -- Evaluation summary table (Precision@1, Precision@5, Recall@5, MRR, NDCG@5) with performance assessment and numbered improvement recommendations. With
--verbose, all aggregate metrics and failure analysis counts are printed. - JSON (
--output) -- Full results object containingaggregate_metrics,query_results(per-query metrics, retrieved docs, relevant docs),failure_analysis(poor precision/recall counts, zero-result counts, query length analysis, failure patterns),evaluation_summary, andrecommendations.
rag_pipeline_designer.py
Purpose: Accepts a system requirements specification and generates a complete RAG pipeline design including component recommendations (chunking, embedding, vector DB, retrieval, reranking, evaluation), cost projections, a Mermaid architecture diagram, and deployment configuration templates.
Usage:
python rag_pipeline_designer.py <requirements> [options]
Flags / Parameters:
| Flag | Type | Default | Description |
|---|---|---|---|
requirements |
positional, required | -- | JSON file containing system requirements (document_types, document_count, avg_document_size, queries_per_day, query_patterns, latency_requirement, budget_monthly, accuracy_priority, cost_priority, maintenance_complexity) |
--output, -o |
string | None | Output file path for the pipeline design in JSON format |
--verbose, -v |
flag | off | Print full configuration templates for each component |
Example:
python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose
Output Formats:
- Console -- Design summary with total monthly cost, per-component recommendations (name, rationale, cost), and a Mermaid architecture diagram. With
--verbose, full JSON configuration templates for each component are printed. - JSON (
--output) -- Complete pipeline design object containing per-componentComponentRecommendationfields (name, type, config, rationale, pros, cons, cost_monthly),total_cost,architecture_diagram(Mermaid markup), andconfig_templates(per-component configs plus deployment/scaling/monitoring settings).
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
churn-prevention
SaaS churn reduction covering cancel flow design, dynamic save offers, exit survey architecture, dunning sequences, payment recovery, win-back campaigns, and churn impact modeling.
popup-cro
Popup and modal optimization for conversion. Covers exit-intent, slide-ins, banners, timing optimization, frequency capping, audience targeting, compliance, and A/B testing frameworks for lead capture, promotions, and announcements.
competitor-alternatives
Competitor comparison and alternative page creation for SEO and sales enablement. Covers 4 page formats (singular alternative, plural alternatives, vs pages, competitor vs competitor), content architecture, research methodology, and centralized competitor data management.
contract-and-proposal-writer
Generate production-ready business documents including freelance contracts, project proposals, SOWs, NDAs, and MSAs with jurisdiction-aware clauses. Covers US (Delaware), EU (GDPR), UK, and DACH (German law) legal frameworks. Includes contract templates, clause libraries, and DOCX conversion. Use when starting client engagements, writing proposals, drafting partnership agreements, or needing GDPR-compliant data processing addenda.
pricing-strategy
SaaS pricing design and optimization covering value metric selection, tier architecture, price point research, pricing page design, price increase execution, and competitive pricing analysis.
referral-program
Referral and affiliate program design covering referral loop architecture, incentive design, trigger moment optimization, viral coefficient modeling, affiliate program structure, and optimization playbook.
Didn't find tool you were looking for?