Agent skill
optimizing-rag
Optimize RAG performance with reranking, caching, parallel processing, and production deployment patterns. Use when improving retrieval quality, adding rerankers, deploying to production, implementing caching strategies, or optimizing for scale and latency.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/optimizing-rag
SKILL.md
Optimizing RAG Performance
Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.
When to Use This Skill
- Improving retrieval accuracy and quality
- Adding reranking to existing RAG pipeline
- Reducing latency and improving response time
- Deploying RAG to production
- Implementing caching for faster re-processing
- Scaling to handle more documents or queries
- Optimizing costs (API usage, compute resources)
Quick Wins (Low Effort, High Impact)
1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)
Benefit: Transforms any embedding into competitive performance Effort: Minimal code change
from llama_index.postprocessor.cohere_rerank import CohereRerank
# For English content
reranker = CohereRerank(
top_n=5,
model="rerank-english-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# For Thai/multilingual content
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# Apply to query engine
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve more candidates
node_postprocessors=[reranker] # Rerank to top 5
)
Best Practice: Always retrieve 10x candidates, rerank to final top_k
2. Enable Parallel Loading → 13x Speedup (30 minutes)
Benefit: 391s → 31s for 32 PDF files Effort: One parameter change
from llama_index.core import SimpleDirectoryReader
# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()
# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
num_workers=10 # Adjust based on CPU cores
)
3. Optimize Batch Size → Faster Embeddings (15 minutes)
Benefit: Reduced API calls, better throughput Effort: Configuration change
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100 # Up from default 10
)
Quick Decision Guide
Reranker Selection
- Best quality → CohereRerank (API-based, best performance)
- Best open-source → bge-reranker-large (local, free)
- Cost-effective → SentenceTransformerRerank (local, fast)
- Multi-language → CohereRerank multilingual-v3.0
Caching Strategy
- Development → Local cache (pipeline.persist)
- Production → Redis cache (distributed)
- When to clear → Model changes, schema updates
Production Deployment
- Small scale (<1M docs) → Serverless (auto-scaling)
- Large scale → Container-based (consistent performance)
- Multi-tenant → Collection isolation + metadata filtering
Optimization Patterns
Pattern 1: Add Best-Practice Reranking
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
model="BAAI/bge-reranker-large",
top_n=5
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)
Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR
Pattern 2: Pipeline Caching (Local)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
# Create pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512),
OpenAIEmbedding()
]
)
# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")
# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents) # Only processes new/changed docs
Pattern 3: Redis Cache (Production)
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore
# Distributed caching
ingest_cache = IngestionCache(
cache=RedisKVStore.from_host_and_port(
host="redis-server",
port=6379
),
collection="rag_pipeline_cache"
)
pipeline = IngestionPipeline(
transformations=[...],
cache=ingest_cache
)
# Cache shared across instances
nodes = pipeline.run(documents=documents)
Pattern 4: Advanced Retrieval Strategies
Metadata Pre-filtering (Sub-50ms):
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="category", value="technical"),
ExactMatchFilter(key="year", value="2024")
]
)
query_engine = index.as_query_engine(
filters=filters, # Narrow search space first
similarity_top_k=5
)
Document Summary Retrieval (For 100+ docs):
from llama_index.core import DocumentSummaryIndex
# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
documents,
response_synthesizer=response_synthesizer
)
retriever = summary_index.as_retriever(similarity_top_k=3)
Chunk Decoupling (Precision + Context):
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # Sentences before/after
window_metadata_key="window"
)
# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
node_postprocessors=[postprocessor],
similarity_top_k=6
)
Pattern 5: Multiple Postprocessors Chain
from llama_index.core.postprocessor import SimilarityPostprocessor
# Progressive refinement
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.7), # Filter low scores
CohereRerank(top_n=10), # Rerank top candidates
MetadataReplacementPostProcessor(...) # Expand context
]
)
Your Codebase Integration
For src/ Pipeline
Add Reranking to All 7 Strategies:
src/10_basic_query_engine.py→ Add CohereReranksrc/16_hybrid_search.py→ Add reranker after fusion- All other strategies: Consistent reranking layer
Enable Caching:
src/09_enhanced_batch_embeddings.py→ Add pipeline cachingsrc/02_prep_doc_for_embedding.py→ Cache preprocessing
For src-iLand/ Pipeline
Thai-Optimized Reranking:
# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0" # Thai support
)
Fast Metadata Filtering (Already implemented):
src-iLand/retrieval/fast_metadata_index.py→ Sub-50ms filtering- Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
- B-tree indices for: area, coordinates
Batch Processing Optimization:
src-iLand/docs_embedding/batch_embedding.py→ Increase batch_sizesrc-iLand/data_processing/→ Enable parallel loading
Detailed References
Load these when you need comprehensive details:
-
reference-reranking.md: Complete reranking guide
- 6 reranker models with benchmarks
- Multi-language reranking
- Cost-performance trade-offs
- Node postprocessor types
-
reference-production.md: Production optimization patterns
- Ingestion pipeline caching (local & Redis)
- Parallel processing (13x speedup)
- Vector store integration
- Multi-tenancy patterns
- Deployment architectures
- Error handling and reliability
-
reference-advanced-retrieval.md: Advanced retrieval strategies
- Document summary retrieval
- Recursive retrieval
- Chunk decoupling
- Sub-question decomposition
- Fusion retrieval
- Auto retriever
Common Workflows
Workflow 1: Add Reranking to Existing Pipeline
-
Step 1: Choose reranker
- Load
reference-reranking.mdfor comparison - For Thai: CohereRerank multilingual-v3.0
- For cost: bge-reranker-large
- Load
-
Step 2: Install dependencies
bashpip install llama-index-postprocessor-cohere-rerank # OR pip install llama-index-postprocessor-flag-embedding-reranker -
Step 3: Update query engine
- Change
similarity_top_kfrom 5 → 10 - Add reranker with
top_n=5
- Change
-
Step 4: Test impact
- Compare retrieval quality before/after
- Expected: 5-15% hit rate improvement
-
Step 5: Deploy to all strategies
- Apply consistently across retrievers
Workflow 2: Enable Production Caching
-
Step 1: Choose caching backend
- Development → Local cache
- Production → Redis cache
-
Step 2: Wrap pipeline
pythonpipeline = IngestionPipeline( transformations=[splitter, embedder], # cache=... (local or Redis) ) -
Step 3: Initial run (builds cache)
pythonnodes = pipeline.run(documents=documents) pipeline.persist("./cache") # Local only -
Step 4: Subsequent runs (uses cache)
- Only processes new/changed documents
- Massive speedup for re-runs
-
Step 5: Monitor cache size
- Clear when storage grows too large
- Clear on model/schema changes
Workflow 3: Deploy to Production
-
Step 1: Review production checklist
- Load
reference-production.mdfor full guide
- Load
-
Step 2: Implement error handling
- Retry logic for API calls
- Fallback strategies for failures
- Circuit breaker pattern
-
Step 3: Add monitoring
- Track retrieval latency (p50, p95, p99)
- Monitor hit rate and MRR
- Log embedding API usage
-
Step 4: Set up caching
- Redis for distributed systems
- Query result caching
- Embedding caching
-
Step 5: Deploy with redundancy
- Load balancing
- Health checks
- Graceful degradation
Workflow 4: Optimize for Scale (100+ Documents)
-
Step 1: Add metadata filtering
- Tag documents with categories
- Filter before vector search (90% reduction)
-
Step 2: Implement document summaries
- Generate summaries for each document
- Two-stage retrieval: doc → chunk
-
Step 3: Enable fast metadata indexing
- Build inverted indices for categorical fields
- B-tree indices for numeric fields
- See
src-iLand/retrieval/fast_metadata_index.py
-
Step 4: Use async operations
pythonretriever = index.as_retriever(use_async=True) -
Step 5: Monitor and tune
- Adjust top_k based on precision/recall
- Optimize chunk size for domain
Performance Benchmarks
Reranking Impact (from reference docs)
| Embedding | Without Rerank | With Cohere Rerank | Improvement |
|---|---|---|---|
| OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% |
| JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% |
| bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |
Parallel Loading Impact
- Sequential: 391 seconds (32 PDF files)
- Parallel (10 workers): 31 seconds
- Speedup: 13x faster
Caching Impact
- First run: Full processing time
- Cached run: Only new/changed documents
- Typical speedup: 10-100x for repeat runs
Key Reminders
Reranking Best Practices:
- Always retrieve 10x candidates, rerank to top-k
- Use multilingual models for Thai content
- Combine with hybrid search for best results
Caching Cautions:
- Clear cache when changing embedding models
- Clear cache when updating document schema
- Monitor cache size growth
Production Essentials:
- Implement retry logic and error handling
- Monitor latency, hit rate, costs
- Use distributed caching (Redis)
- Enable async operations for parallel retrieval
Scripts
This skill includes utility scripts in the scripts/ directory:
validate_config.py
Validates RAG configuration before deployment:
python .claude/skills/optimizing-rag/scripts/validate_config.py \
--config-file ./config.yaml
Checks:
- Chunk size appropriate for domain
- Embedding model consistency
- Top-k values reasonable
- Reranker configuration
benchmark_performance.py
Measures retrieval performance:
python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
--index-path ./index \
--queries-file ./test_queries.txt
Reports:
- Retrieval latency (p50, p95, p99)
- Throughput (queries/second)
- Memory usage
Next Steps
After optimizing:
- Evaluate: Use
evaluating-ragskill to measure improvements with hit rate and MRR - Monitor: Set up continuous evaluation in production
- Iterate: Use metrics to guide further optimizations
Didn't find tool you were looking for?