Agent skill

optimizing-rag

Optimize RAG performance with reranking, caching, parallel processing, and production deployment patterns. Use when improving retrieval quality, adding rerankers, deploying to production, implementing caching strategies, or optimizing for scale and latency.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/optimizing-rag

SKILL.md

Optimizing RAG Performance

Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.

When to Use This Skill

  • Improving retrieval accuracy and quality
  • Adding reranking to existing RAG pipeline
  • Reducing latency and improving response time
  • Deploying RAG to production
  • Implementing caching for faster re-processing
  • Scaling to handle more documents or queries
  • Optimizing costs (API usage, compute resources)

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

Benefit: Transforms any embedding into competitive performance Effort: Minimal code change

python
from llama_index.postprocessor.cohere_rerank import CohereRerank

# For English content
reranker = CohereRerank(
    top_n=5,
    model="rerank-english-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# For Thai/multilingual content
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Apply to query engine
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more candidates
    node_postprocessors=[reranker]  # Rerank to top 5
)

Best Practice: Always retrieve 10x candidates, rerank to final top_k

2. Enable Parallel Loading → 13x Speedup (30 minutes)

Benefit: 391s → 31s for 32 PDF files Effort: One parameter change

python
from llama_index.core import SimpleDirectoryReader

# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()

# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
    num_workers=10  # Adjust based on CPU cores
)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Benefit: Reduced API calls, better throughput Effort: Configuration change

python
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100  # Up from default 10
)

Quick Decision Guide

Reranker Selection

  • Best quality → CohereRerank (API-based, best performance)
  • Best open-source → bge-reranker-large (local, free)
  • Cost-effective → SentenceTransformerRerank (local, fast)
  • Multi-language → CohereRerank multilingual-v3.0

Caching Strategy

  • Development → Local cache (pipeline.persist)
  • Production → Redis cache (distributed)
  • When to clear → Model changes, schema updates

Production Deployment

  • Small scale (<1M docs) → Serverless (auto-scaling)
  • Large scale → Container-based (consistent performance)
  • Multi-tenant → Collection isolation + metadata filtering

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

python
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
    model="BAAI/bge-reranker-large",
    top_n=5
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR

Pattern 2: Pipeline Caching (Local)

python
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Create pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512),
        OpenAIEmbedding()
    ]
)

# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")

# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents)  # Only processes new/changed docs

Pattern 3: Redis Cache (Production)

python
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore

# Distributed caching
ingest_cache = IngestionCache(
    cache=RedisKVStore.from_host_and_port(
        host="redis-server",
        port=6379
    ),
    collection="rag_pipeline_cache"
)

pipeline = IngestionPipeline(
    transformations=[...],
    cache=ingest_cache
)

# Cache shared across instances
nodes = pipeline.run(documents=documents)

Pattern 4: Advanced Retrieval Strategies

Metadata Pre-filtering (Sub-50ms):

python
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="category", value="technical"),
        ExactMatchFilter(key="year", value="2024")
    ]
)

query_engine = index.as_query_engine(
    filters=filters,  # Narrow search space first
    similarity_top_k=5
)

Document Summary Retrieval (For 100+ docs):

python
from llama_index.core import DocumentSummaryIndex

# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
    documents,
    response_synthesizer=response_synthesizer
)

retriever = summary_index.as_retriever(similarity_top_k=3)

Chunk Decoupling (Precision + Context):

python
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Sentences before/after
    window_metadata_key="window"
)

# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    node_postprocessors=[postprocessor],
    similarity_top_k=6
)

Pattern 5: Multiple Postprocessors Chain

python
from llama_index.core.postprocessor import SimilarityPostprocessor

# Progressive refinement
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),  # Filter low scores
        CohereRerank(top_n=10),                          # Rerank top candidates
        MetadataReplacementPostProcessor(...)             # Expand context
    ]
)

Your Codebase Integration

For src/ Pipeline

Add Reranking to All 7 Strategies:

  • src/10_basic_query_engine.py → Add CohereRerank
  • src/16_hybrid_search.py → Add reranker after fusion
  • All other strategies: Consistent reranking layer

Enable Caching:

  • src/09_enhanced_batch_embeddings.py → Add pipeline caching
  • src/02_prep_doc_for_embedding.py → Cache preprocessing

For src-iLand/ Pipeline

Thai-Optimized Reranking:

python
# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0"  # Thai support
)

Fast Metadata Filtering (Already implemented):

  • src-iLand/retrieval/fast_metadata_index.py → Sub-50ms filtering
  • Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
  • B-tree indices for: area, coordinates

Batch Processing Optimization:

  • src-iLand/docs_embedding/batch_embedding.py → Increase batch_size
  • src-iLand/data_processing/ → Enable parallel loading

Detailed References

Load these when you need comprehensive details:

  • reference-reranking.md: Complete reranking guide

    • 6 reranker models with benchmarks
    • Multi-language reranking
    • Cost-performance trade-offs
    • Node postprocessor types
  • reference-production.md: Production optimization patterns

    • Ingestion pipeline caching (local & Redis)
    • Parallel processing (13x speedup)
    • Vector store integration
    • Multi-tenancy patterns
    • Deployment architectures
    • Error handling and reliability
  • reference-advanced-retrieval.md: Advanced retrieval strategies

    • Document summary retrieval
    • Recursive retrieval
    • Chunk decoupling
    • Sub-question decomposition
    • Fusion retrieval
    • Auto retriever

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

  • Step 1: Choose reranker

    • Load reference-reranking.md for comparison
    • For Thai: CohereRerank multilingual-v3.0
    • For cost: bge-reranker-large
  • Step 2: Install dependencies

    bash
    pip install llama-index-postprocessor-cohere-rerank
    # OR
    pip install llama-index-postprocessor-flag-embedding-reranker
    
  • Step 3: Update query engine

    • Change similarity_top_k from 5 → 10
    • Add reranker with top_n=5
  • Step 4: Test impact

    • Compare retrieval quality before/after
    • Expected: 5-15% hit rate improvement
  • Step 5: Deploy to all strategies

    • Apply consistently across retrievers

Workflow 2: Enable Production Caching

  • Step 1: Choose caching backend

    • Development → Local cache
    • Production → Redis cache
  • Step 2: Wrap pipeline

    python
    pipeline = IngestionPipeline(
        transformations=[splitter, embedder],
        # cache=... (local or Redis)
    )
    
  • Step 3: Initial run (builds cache)

    python
    nodes = pipeline.run(documents=documents)
    pipeline.persist("./cache")  # Local only
    
  • Step 4: Subsequent runs (uses cache)

    • Only processes new/changed documents
    • Massive speedup for re-runs
  • Step 5: Monitor cache size

    • Clear when storage grows too large
    • Clear on model/schema changes

Workflow 3: Deploy to Production

  • Step 1: Review production checklist

    • Load reference-production.md for full guide
  • Step 2: Implement error handling

    • Retry logic for API calls
    • Fallback strategies for failures
    • Circuit breaker pattern
  • Step 3: Add monitoring

    • Track retrieval latency (p50, p95, p99)
    • Monitor hit rate and MRR
    • Log embedding API usage
  • Step 4: Set up caching

    • Redis for distributed systems
    • Query result caching
    • Embedding caching
  • Step 5: Deploy with redundancy

    • Load balancing
    • Health checks
    • Graceful degradation

Workflow 4: Optimize for Scale (100+ Documents)

  • Step 1: Add metadata filtering

    • Tag documents with categories
    • Filter before vector search (90% reduction)
  • Step 2: Implement document summaries

    • Generate summaries for each document
    • Two-stage retrieval: doc → chunk
  • Step 3: Enable fast metadata indexing

    • Build inverted indices for categorical fields
    • B-tree indices for numeric fields
    • See src-iLand/retrieval/fast_metadata_index.py
  • Step 4: Use async operations

    python
    retriever = index.as_retriever(use_async=True)
    
  • Step 5: Monitor and tune

    • Adjust top_k based on precision/recall
    • Optimize chunk size for domain

Performance Benchmarks

Reranking Impact (from reference docs)

Embedding Without Rerank With Cohere Rerank Improvement
OpenAI 0.870 hit rate 0.927 hit rate +6.6%
JinaAI Base 0.880 hit rate 0.933 hit rate +6.0%
bge-large 0.820 hit rate 0.876 hit rate +6.8%

Parallel Loading Impact

  • Sequential: 391 seconds (32 PDF files)
  • Parallel (10 workers): 31 seconds
  • Speedup: 13x faster

Caching Impact

  • First run: Full processing time
  • Cached run: Only new/changed documents
  • Typical speedup: 10-100x for repeat runs

Key Reminders

Reranking Best Practices:

  • Always retrieve 10x candidates, rerank to top-k
  • Use multilingual models for Thai content
  • Combine with hybrid search for best results

Caching Cautions:

  • Clear cache when changing embedding models
  • Clear cache when updating document schema
  • Monitor cache size growth

Production Essentials:

  • Implement retry logic and error handling
  • Monitor latency, hit rate, costs
  • Use distributed caching (Redis)
  • Enable async operations for parallel retrieval

Scripts

This skill includes utility scripts in the scripts/ directory:

validate_config.py

Validates RAG configuration before deployment:

bash
python .claude/skills/optimizing-rag/scripts/validate_config.py \
    --config-file ./config.yaml

Checks:

  • Chunk size appropriate for domain
  • Embedding model consistency
  • Top-k values reasonable
  • Reranker configuration

benchmark_performance.py

Measures retrieval performance:

bash
python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
    --index-path ./index \
    --queries-file ./test_queries.txt

Reports:

  • Retrieval latency (p50, p95, p99)
  • Throughput (queries/second)
  • Memory usage

Next Steps

After optimizing:

  • Evaluate: Use evaluating-rag skill to measure improvements with hit rate and MRR
  • Monitor: Set up continuous evaluation in production
  • Iterate: Use metrics to guide further optimizations

Didn't find tool you were looking for?

Be as detailed as possible for better results