Optimizing RAG Performance

Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.

When to Use This Skill

Improving retrieval accuracy and quality
Adding reranking to existing RAG pipeline
Reducing latency and improving response time
Deploying RAG to production
Implementing caching for faster re-processing
Scaling to handle more documents or queries
Optimizing costs (API usage, compute resources)

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

Benefit: Transforms any embedding into competitive performance Effort: Minimal code change

python

from llama_index.postprocessor.cohere_rerank import CohereRerank

# For English content
reranker = CohereRerank(
    top_n=5,
    model="rerank-english-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# For Thai/multilingual content
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Apply to query engine
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more candidates
    node_postprocessors=[reranker]  # Rerank to top 5
)

Best Practice: Always retrieve 10x candidates, rerank to final top_k

2. Enable Parallel Loading → 13x Speedup (30 minutes)

Benefit: 391s → 31s for 32 PDF files Effort: One parameter change

python

from llama_index.core import SimpleDirectoryReader

# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()

# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
    num_workers=10  # Adjust based on CPU cores
)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Benefit: Reduced API calls, better throughput Effort: Configuration change

python

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100  # Up from default 10
)

Quick Decision Guide

Reranker Selection

Best quality → CohereRerank (API-based, best performance)
Best open-source → bge-reranker-large (local, free)
Cost-effective → SentenceTransformerRerank (local, fast)
Multi-language → CohereRerank multilingual-v3.0

Caching Strategy

Development → Local cache (pipeline.persist)
Production → Redis cache (distributed)
When to clear → Model changes, schema updates

Production Deployment

Small scale (<1M docs) → Serverless (auto-scaling)
Large scale → Container-based (consistent performance)
Multi-tenant → Collection isolation + metadata filtering

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

python

from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
    model="BAAI/bge-reranker-large",
    top_n=5
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR

Pattern 2: Pipeline Caching (Local)

python

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Create pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512),
        OpenAIEmbedding()
    ]
)

# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")

# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents)  # Only processes new/changed docs

Pattern 3: Redis Cache (Production)

python

from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore

# Distributed caching
ingest_cache = IngestionCache(
    cache=RedisKVStore.from_host_and_port(
        host="redis-server",
        port=6379
    ),
    collection="rag_pipeline_cache"
)

pipeline = IngestionPipeline(
    transformations=[...],
    cache=ingest_cache
)

# Cache shared across instances
nodes = pipeline.run(documents=documents)

Pattern 4: Advanced Retrieval Strategies

Metadata Pre-filtering (Sub-50ms):

python

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="category", value="technical"),
        ExactMatchFilter(key="year", value="2024")
    ]
)

query_engine = index.as_query_engine(
    filters=filters,  # Narrow search space first
    similarity_top_k=5
)

Document Summary Retrieval (For 100+ docs):

python

from llama_index.core import DocumentSummaryIndex

# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
    documents,
    response_synthesizer=response_synthesizer
)

retriever = summary_index.as_retriever(similarity_top_k=3)

Chunk Decoupling (Precision + Context):

python

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Sentences before/after
    window_metadata_key="window"
)

# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    node_postprocessors=[postprocessor],
    similarity_top_k=6
)

Pattern 5: Multiple Postprocessors Chain

python

from llama_index.core.postprocessor import SimilarityPostprocessor

# Progressive refinement
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),  # Filter low scores
        CohereRerank(top_n=10),                          # Rerank top candidates
        MetadataReplacementPostProcessor(...)             # Expand context
    ]
)

Your Codebase Integration

For `src/` Pipeline

Add Reranking to All 7 Strategies:

src/10_basic_query_engine.py → Add CohereRerank
src/16_hybrid_search.py → Add reranker after fusion
All other strategies: Consistent reranking layer

Enable Caching:

src/09_enhanced_batch_embeddings.py → Add pipeline caching
src/02_prep_doc_for_embedding.py → Cache preprocessing

For `src-iLand/` Pipeline

Thai-Optimized Reranking:

python

# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0"  # Thai support
)

Fast Metadata Filtering (Already implemented):

src-iLand/retrieval/fast_metadata_index.py → Sub-50ms filtering
Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
B-tree indices for: area, coordinates

Batch Processing Optimization:

src-iLand/docs_embedding/batch_embedding.py → Increase batch_size
src-iLand/data_processing/ → Enable parallel loading

Detailed References

Load these when you need comprehensive details:

reference-reranking.md: Complete reranking guide
- 6 reranker models with benchmarks
- Multi-language reranking
- Cost-performance trade-offs
- Node postprocessor types
reference-production.md: Production optimization patterns
- Ingestion pipeline caching (local & Redis)
- Parallel processing (13x speedup)
- Vector store integration
- Multi-tenancy patterns
- Deployment architectures
- Error handling and reliability
reference-advanced-retrieval.md: Advanced retrieval strategies
- Document summary retrieval
- Recursive retrieval
- Chunk decoupling
- Sub-question decomposition
- Fusion retrieval
- Auto retriever

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

Step 1: Choose reranker
- Load reference-reranking.md for comparison
- For Thai: CohereRerank multilingual-v3.0
- For cost: bge-reranker-large

Step 2: Install dependencies

bash

pip install llama-index-postprocessor-cohere-rerank
# OR
pip install llama-index-postprocessor-flag-embedding-reranker

Step 3: Update query engine
- Change similarity_top_k from 5 → 10
- Add reranker with top_n=5
Step 4: Test impact
- Compare retrieval quality before/after
- Expected: 5-15% hit rate improvement
Step 5: Deploy to all strategies
- Apply consistently across retrievers

Workflow 2: Enable Production Caching

Step 1: Choose caching backend
- Development → Local cache
- Production → Redis cache

Step 2: Wrap pipeline

python

pipeline = IngestionPipeline(
    transformations=[splitter, embedder],
    # cache=... (local or Redis)
)

Step 3: Initial run (builds cache)

python

nodes = pipeline.run(documents=documents)
pipeline.persist("./cache")  # Local only

Step 4: Subsequent runs (uses cache)
- Only processes new/changed documents
- Massive speedup for re-runs
Step 5: Monitor cache size
- Clear when storage grows too large
- Clear on model/schema changes

Workflow 3: Deploy to Production

Step 1: Review production checklist
- Load reference-production.md for full guide
Step 2: Implement error handling
- Retry logic for API calls
- Fallback strategies for failures
- Circuit breaker pattern
Step 3: Add monitoring
- Track retrieval latency (p50, p95, p99)
- Monitor hit rate and MRR
- Log embedding API usage
Step 4: Set up caching
- Redis for distributed systems
- Query result caching
- Embedding caching
Step 5: Deploy with redundancy
- Load balancing
- Health checks
- Graceful degradation

Workflow 4: Optimize for Scale (100+ Documents)

Step 1: Add metadata filtering
- Tag documents with categories
- Filter before vector search (90% reduction)
Step 2: Implement document summaries
- Generate summaries for each document
- Two-stage retrieval: doc → chunk
Step 3: Enable fast metadata indexing
- Build inverted indices for categorical fields
- B-tree indices for numeric fields
- See src-iLand/retrieval/fast_metadata_index.py

Step 4: Use async operations

python

retriever = index.as_retriever(use_async=True)

Step 5: Monitor and tune
- Adjust top_k based on precision/recall
- Optimize chunk size for domain

Performance Benchmarks

Reranking Impact (from reference docs)

Embedding	Without Rerank	With Cohere Rerank	Improvement
OpenAI	0.870 hit rate	0.927 hit rate	+6.6%
JinaAI Base	0.880 hit rate	0.933 hit rate	+6.0%
bge-large	0.820 hit rate	0.876 hit rate	+6.8%

Parallel Loading Impact

Sequential: 391 seconds (32 PDF files)
Parallel (10 workers): 31 seconds
Speedup: 13x faster

Caching Impact

First run: Full processing time
Cached run: Only new/changed documents
Typical speedup: 10-100x for repeat runs

Key Reminders

Reranking Best Practices:

Always retrieve 10x candidates, rerank to top-k
Use multilingual models for Thai content
Combine with hybrid search for best results

Caching Cautions:

Clear cache when changing embedding models
Clear cache when updating document schema
Monitor cache size growth

Production Essentials:

Implement retry logic and error handling
Monitor latency, hit rate, costs
Use distributed caching (Redis)
Enable async operations for parallel retrieval

Scripts

This skill includes utility scripts in the scripts/ directory:

validate_config.py

Validates RAG configuration before deployment:

bash

python .claude/skills/optimizing-rag/scripts/validate_config.py \
    --config-file ./config.yaml

Checks:

Chunk size appropriate for domain
Embedding model consistency
Top-k values reasonable
Reranker configuration

benchmark_performance.py

Measures retrieval performance:

bash

python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
    --index-path ./index \
    --queries-file ./test_queries.txt

Reports:

Retrieval latency (p50, p95, p99)
Throughput (queries/second)
Memory usage

Next Steps

After optimizing:

Evaluate: Use evaluating-rag skill to measure improvements with hit rate and MRR
Monitor: Set up continuous evaluation in production
Iterate: Use metrics to guide further optimizations

Search AI Tools

optimizing-rag

Install this agent skill to your Project

SKILL.md

Optimizing RAG Performance

When to Use This Skill

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

2. Enable Parallel Loading → 13x Speedup (30 minutes)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Quick Decision Guide

Reranker Selection

Caching Strategy

Production Deployment

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

Pattern 2: Pipeline Caching (Local)

Pattern 3: Redis Cache (Production)

Pattern 4: Advanced Retrieval Strategies

Pattern 5: Multiple Postprocessors Chain

Your Codebase Integration

For `src/` Pipeline

For `src-iLand/` Pipeline

Detailed References

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

Workflow 2: Enable Production Caching

Workflow 3: Deploy to Production

Workflow 4: Optimize for Scale (100+ Documents)

Performance Benchmarks

Reranking Impact (from reference docs)

Parallel Loading Impact

Caching Impact

Key Reminders

Scripts

validate_config.py

benchmark_performance.py

Next Steps

Search AI Tools

Install this agent skill to your Project

SKILL.md

Optimizing RAG Performance

When to Use This Skill

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

2. Enable Parallel Loading → 13x Speedup (30 minutes)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Quick Decision Guide

Reranker Selection

Caching Strategy

Production Deployment

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

Pattern 2: Pipeline Caching (Local)

Pattern 3: Redis Cache (Production)

Pattern 4: Advanced Retrieval Strategies

Pattern 5: Multiple Postprocessors Chain

Your Codebase Integration

For src/ Pipeline

For src-iLand/ Pipeline

Detailed References

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

Workflow 2: Enable Production Caching

Workflow 3: Deploy to Production

Workflow 4: Optimize for Scale (100+ Documents)

Performance Benchmarks

Reranking Impact (from reference docs)

Parallel Loading Impact

Caching Impact

Key Reminders

Scripts

validate_config.py

benchmark_performance.py

Next Steps

For `src/` Pipeline

For `src-iLand/` Pipeline