Agent skill

RAG Evaluation

Comprehensive guide to evaluating Retrieval-Augmented Generation systems including retrieval metrics, generation quality, faithfulness, and end-to-end evaluation frameworks

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/rag-evaluation

SKILL.md

RAG Evaluation

What is RAG Evaluation?

Definition: Measuring quality of both retrieval and generation components in RAG systems.

Components

RAG System = Retrieval + Generation

Retrieval: Query → Relevant documents
Generation: Documents + Query → Answer

Both need evaluation!

Evaluation Levels

1. Component-level: Retrieval quality, generation quality
2. End-to-end: Overall answer quality
3. User-level: User satisfaction, task success

Why RAG Evaluation Matters

RAG Quality Varies Widely

Example:

Query: "What is the capital of France?"

Bad RAG:
- Retrieves: Document about French cuisine
- Generates: "France is known for its wine" (irrelevant)

Good RAG:
- Retrieves: Document about Paris
- Generates: "The capital of France is Paris" (correct)

Retrieval Errors → Wrong Context → Bad Answers

Error Cascade:

Poor retrieval (irrelevant docs)
→ Wrong context for LLM
→ LLM generates answer from wrong info
→ Hallucination or incorrect answer

Need Metrics to Improve Systematically

Without Metrics:

"This answer seems wrong" → Unclear what to fix

With Metrics:

Context Precision: 0.3 (low) → Improve retrieval
Faithfulness: 0.6 (low) → Improve prompt to reduce hallucination

RAG Components to Evaluate

1. Retrieval: Are Relevant Docs Retrieved?

Metrics:

Precision@k
Recall@k
MRR (Mean Reciprocal Rank)
NDCG

2. Context: Is Context Sufficient for Answer?

Metrics:

Context relevance
Context precision
Context recall

3. Generation: Is Answer Correct, Relevant, Safe?

Metrics:

Faithfulness (no hallucination)
Answer relevance
Correctness
Completeness
Safety

Retrieval Evaluation Metrics

Precision@k: % of Top-k Results Relevant

Formula:

Precision@k = (# relevant docs in top-k) / k

Example:

Query: "Python list methods"
Top 5 results:
1. Python list.append() ✓ (relevant)
2. Python list.extend() ✓ (relevant)
3. Java ArrayList ✗ (not relevant)
4. Python list.pop() ✓ (relevant)
5. Python dictionaries ✗ (not relevant)

Precision@5 = 3/5 = 0.6

Recall@k: % of Relevant Docs in Top-k

Formula:

Recall@k = (# relevant docs in top-k) / (total # relevant docs)

Example:

Total relevant docs: 10
Relevant docs in top-5: 3

Recall@5 = 3/10 = 0.3

MRR (Mean Reciprocal Rank): Position of First Relevant Doc

Formula:

RR = 1 / (rank of first relevant doc)
MRR = average RR across queries

Example:

Query 1: First relevant at position 1 → RR = 1/1 = 1.0
Query 2: First relevant at position 3 → RR = 1/3 = 0.33
Query 3: First relevant at position 2 → RR = 1/2 = 0.5

MRR = (1.0 + 0.33 + 0.5) / 3 = 0.61

NDCG (Normalized Discounted Cumulative Gain)

Purpose: Considers both relevance and position

Formula:

DCG@k = Σ (relevance_i / log2(i+1))
NDCG@k = DCG@k / IDCG@k (normalized)

Example:

Top 3 results with relevance scores (0-3):
Position 1: relevance = 3
Position 2: relevance = 2
Position 3: relevance = 1

DCG@3 = 3/log2(2) + 2/log2(3) + 1/log2(4)
      = 3/1 + 2/1.58 + 1/2
      = 3 + 1.26 + 0.5 = 4.76

Context Relevance Score

LLM-as-Judge:

python

def evaluate_context_relevance(query, context):
    prompt = f"""
    Query: {query}
    Context: {context}
    
    Rate the relevance of the context to the query on a scale of 1-5:
    1 = Not relevant at all
    5 = Highly relevant
    
    Score:
    """
    
    score = llm.generate(prompt)
    return int(score)

Generation Evaluation Metrics

Faithfulness: Answer Grounded in Context (No Hallucination)

Definition: All claims in answer are supported by context

Example:

Context: "Paris is the capital of France. It has a population of 2.1 million."
Question: "What is the capital of France?"

Faithful answer: "Paris is the capital of France."
Unfaithful answer: "Paris is the capital of France with 10 million people." (hallucination)

Answer Relevance: Answer Addresses the Question

Definition: Answer is on-topic and addresses what was asked

Example:

Question: "What is the capital of France?"

Relevant answer: "The capital of France is Paris."
Irrelevant answer: "France is a country in Europe." (true but doesn't answer)

Correctness: Answer is Factually Correct

Definition: Answer matches ground truth

Example:

Question: "What is 2+2?"
Ground truth: "4"

Correct answer: "4"
Incorrect answer: "5"

Completeness: Answer Covers All Aspects

Definition: Answer addresses all parts of question

Example:

Question: "What are the capital and population of France?"

Complete answer: "The capital of France is Paris, with a population of about 67 million."
Incomplete answer: "The capital is Paris." (missing population)

Faithfulness Evaluation

Method 1: LLM-as-Judge (Does Answer Match Context?)

Prompt:

python

def evaluate_faithfulness(context, answer):
    prompt = f"""
    Context: {context}
    Answer: {answer}
    
    Is the answer fully supported by the context? 
    Check if all claims in the answer can be verified from the context.
    
    Respond with:
    - "Yes" if all claims are supported
    - "No" if any claim is not supported or contradicts the context
    - Explain your reasoning
    
    Verdict:
    """
    
    response = llm.generate(prompt)
    return "Yes" in response

Method 2: NLI Model (Entailment Check)

Natural Language Inference:

python

from transformers import pipeline

nli = pipeline("text-classification", model="roberta-large-mnli")

def check_faithfulness(context, answer):
    # Check if context entails answer
    result = nli(f"{context} [SEP] {answer}")
    
    # result: {"label": "ENTAILMENT", "score": 0.95}
    return result["label"] == "ENTAILMENT"

Method 3: Citation Checking (Are Claims Cited?)

Approach:

1. Extract claims from answer
2. For each claim, check if it appears in context
3. Calculate % of claims with citations

Example:

python

def check_citations(context, answer):
    # Extract sentences from answer
    claims = answer.split('.')
    
    cited_claims = 0
    for claim in claims:
        if claim.strip() in context:
            cited_claims += 1
    
    citation_rate = cited_claims / len(claims)
    return citation_rate

Answer Relevance Evaluation

LLM-as-Judge: "Does Answer Address Question?"

Prompt:

python

def evaluate_relevance(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Does the answer directly address the question?
    
    Rate on scale 1-5:
    1 = Completely irrelevant
    5 = Perfectly addresses the question
    
    Score:
    """
    
    score = llm.generate(prompt)
    return int(score)

Semantic Similarity (Answer vs Expected Answer)

Embedding-Based:

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(answer, expected_answer):
    # Encode both answers
    emb1 = model.encode(answer)
    emb2 = model.encode(expected_answer)
    
    # Cosine similarity
    similarity = cosine_similarity([emb1], [emb2])[0][0]
    return similarity

User Feedback (Thumbs Up/Down)

Implicit Signal:

python

# Track user feedback
feedback = {
    "question": "What is the capital of France?",
    "answer": "Paris is the capital of France.",
    "user_feedback": "thumbs_up",  # or "thumbs_down"
    "timestamp": "2024-01-15T10:00:00Z"
}

# Aggregate
thumbs_up_rate = thumbs_up / (thumbs_up + thumbs_down)

Correctness Evaluation

Ground Truth Comparison (If Available)

Exact Match:

python

def exact_match(answer, ground_truth):
    return answer.strip().lower() == ground_truth.strip().lower()

F1 Score (Token Overlap):

python

def f1_score(answer, ground_truth):
    answer_tokens = set(answer.lower().split())
    gt_tokens = set(ground_truth.lower().split())
    
    if len(answer_tokens) == 0 or len(gt_tokens) == 0:
        return 0
    
    common = answer_tokens & gt_tokens
    precision = len(common) / len(answer_tokens)
    recall = len(common) / len(gt_tokens)
    
    if precision + recall == 0:
        return 0
    
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

LLM-as-Judge with Rubric

Prompt:

python

def evaluate_correctness(question, answer, ground_truth):
    prompt = f"""
    Question: {question}
    Expected Answer: {ground_truth}
    Actual Answer: {answer}
    
    Is the actual answer correct compared to the expected answer?
    Consider:
    - Factual accuracy
    - Completeness
    - Semantic equivalence (different wording is OK)
    
    Rate on scale 1-5:
    1 = Completely incorrect
    5 = Perfectly correct
    
    Score:
    """
    
    score = llm.generate(prompt)
    return int(score)

Human Evaluation (Gold Standard)

Process:

1. Sample answers (e.g., 100 random)
2. Human annotators rate correctness (1-5)
3. Calculate inter-annotator agreement
4. Use as gold standard
5. Validate automated metrics against human scores

RAG-Specific Metrics

Context Precision: Relevant Chunks in Context

Definition: % of retrieved chunks that are relevant

Formula:

Context Precision = (# relevant chunks) / (# total chunks retrieved)

Example:

Retrieved 5 chunks:
- Chunk 1: Relevant ✓
- Chunk 2: Relevant ✓
- Chunk 3: Not relevant ✗
- Chunk 4: Relevant ✓
- Chunk 5: Not relevant ✗

Context Precision = 3/5 = 0.6

Context Recall: All Needed Info Retrieved

Definition: % of needed information that was retrieved

Formula:

Context Recall = (# needed facts retrieved) / (# total needed facts)

Example:

Question: "What are the capital and population of France?"
Needed facts: [capital, population]

Retrieved context contains:
- Capital: Yes ✓
- Population: No ✗

Context Recall = 1/2 = 0.5

Context Relevance: Context Relevance to Question

LLM-as-Judge:

python

def context_relevance(question, context):
    prompt = f"""
    Question: {question}
    Context: {context}
    
    How relevant is the context to answering the question?
    
    Rate 1-5:
    1 = Not relevant
    5 = Highly relevant
    
    Score:
    """
    
    score = llm.generate(prompt)
    return int(score) / 5  # Normalize to 0-1

Answer Faithfulness: No Hallucinations

See "Faithfulness Evaluation" section above

Answer Relevance: On-Topic Answer

See "Answer Relevance Evaluation" section above

Creating Evaluation Dataset

Question-Answer Pairs (Ground Truth)

Structure:

json

{
  "question": "What is the capital of France?",
  "answer": "Paris",
  "category": "geography",
  "difficulty": "easy"
}

Question-Context-Answer Triples

Structure:

json

{
  "question": "What is the capital of France?",
  "context": "Paris is the capital and largest city of France. It has a population of 2.1 million.",
  "answer": "Paris is the capital of France.",
  "relevant_chunks": ["Paris is the capital and largest city of France."]
}

Diverse Questions (Simple, Complex, Multi-Hop)

Simple:

"What is the capital of France?" → Single fact

Complex:

"Compare the populations of Paris and London." → Multiple facts + reasoning

Multi-Hop:

"What is the population of the capital of France?" → Requires 2 steps:
1. Find capital of France (Paris)
2. Find population of Paris

Edge Cases (Ambiguous, No Answer)

Ambiguous:

"What is the best programming language?" → Subjective, no single answer

No Answer:

"What is the capital of Atlantis?" → No valid answer (fictional place)

Unanswerable from Context:

Context: "Paris is a city in France."
Question: "What is the population of Paris?"
Answer: "Cannot be determined from the given context."

Evaluation Frameworks

RAGAS (Popular Framework)

Install:

bash

pip install ragas

Usage:

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

# Prepare dataset
dataset = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["Paris is the capital of France. It has 2.1M people."]],
    "ground_truth": ["Paris"]
}

# Evaluate
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

print(result)
# {
#   "faithfulness": 0.95,
#   "answer_relevancy": 0.98,
#   "context_precision": 1.0,
#   "context_recall": 1.0
# }

TruLens

Features:

Real-time evaluation
Feedback functions
Dashboard

Usage:

python

from trulens_eval import TruChain, Feedback, Tru

# Define feedback functions
f_groundedness = Feedback(groundedness_measure).on_output()
f_answer_relevance = Feedback(relevance_measure).on_input_output()

# Wrap your RAG chain
tru_recorder = TruChain(
    rag_chain,
    app_id="my_rag_app",
    feedbacks=[f_groundedness, f_answer_relevance]
)

# Use as normal
with tru_recorder:
    answer = rag_chain.run("What is the capital of France?")

# View dashboard
tru = Tru()
tru.run_dashboard()

DeepEval

Features:

Multiple metrics
LLM-as-judge
Custom metrics

Usage:

python

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# Define test case
test_case = {
    "input": "What is the capital of France?",
    "actual_output": "Paris is the capital of France.",
    "expected_output": "Paris",
    "context": ["Paris is the capital of France."]
}

# Evaluate
metrics = [
    AnswerRelevancyMetric(),
    FaithfulnessMetric()
]

result = evaluate(test_case, metrics)

Langfuse

Features:

Observability
Tracing
Evaluation
Analytics

Custom Evaluation Scripts

Example:

python

def evaluate_rag(questions, answers, contexts, ground_truths):
    results = []
    
    for q, a, c, gt in zip(questions, answers, contexts, ground_truths):
        result = {
            "question": q,
            "answer": a,
            "faithfulness": evaluate_faithfulness(c, a),
            "relevance": evaluate_relevance(q, a),
            "correctness": f1_score(a, gt),
            "context_precision": calculate_context_precision(q, c)
        }
        results.append(result)
    
    # Aggregate
    avg_faithfulness = sum(r["faithfulness"] for r in results) / len(results)
    avg_relevance = sum(r["relevance"] for r in results) / len(results)
    
    return {
        "avg_faithfulness": avg_faithfulness,
        "avg_relevance": avg_relevance,
        "results": results
    }

RAGAS Metrics

Context Precision

Definition: Measures if all retrieved contexts are relevant

Calculation:

For each context chunk:
  Is it relevant to answering the question?

Context Precision = (# relevant chunks) / (# total chunks)

Context Recall

Definition: Measures if all needed information was retrieved

Calculation:

For each fact in ground truth:
  Is it present in retrieved context?

Context Recall = (# facts found) / (# total facts needed)

Faithfulness

Definition: Measures if answer is grounded in context

Calculation:

For each claim in answer:
  Can it be inferred from context?

Faithfulness = (# supported claims) / (# total claims)

Answer Relevance

Definition: Measures if answer addresses the question

Calculation:

Generate questions from answer
Compare to original question
Similarity score

Answer Semantic Similarity

Definition: Semantic similarity between answer and ground truth

Calculation:

Embedding similarity (cosine)

Answer Correctness

Definition: Combination of semantic similarity and factual correctness

Calculation:

Weighted average of:
- Semantic similarity (50%)
- Factual overlap (50%)

LLM-as-Judge Patterns

Use GPT-4 or Claude to Grade Answers

Example:

python

import openai

def llm_judge(question, answer, context):
    prompt = f"""
    You are evaluating a RAG system's answer.
    
    Question: {question}
    Context: {context}
    Answer: {answer}
    
    Evaluate on these criteria (1-5 scale):
    1. Faithfulness: Is answer supported by context?
    2. Relevance: Does answer address the question?
    3. Completeness: Does answer cover all aspects?
    
    Provide scores and brief reasoning.
    
    Format:
    Faithfulness: [score] - [reasoning]
    Relevance: [score] - [reasoning]
    Completeness: [score] - [reasoning]
    Overall: [average score]
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

Provide Rubric (1-5 Scale)

Rubric:

Faithfulness:
5 = All claims fully supported by context
4 = Most claims supported, minor unsupported details
3 = Some claims supported, some not
2 = Few claims supported
1 = No claims supported (hallucination)

Relevance:
5 = Perfectly addresses question
4 = Addresses question with minor irrelevant details
3 = Partially addresses question
2 = Barely addresses question
1 = Completely irrelevant

Check Faithfulness, Relevance, Quality

See examples above

Aggregate Scores

Aggregation:

python

def aggregate_scores(evaluations):
    avg_faithfulness = sum(e["faithfulness"] for e in evaluations) / len(evaluations)
    avg_relevance = sum(e["relevance"] for e in evaluations) / len(evaluations)
    avg_completeness = sum(e["completeness"] for e in evaluations) / len(evaluations)
    
    overall = (avg_faithfulness + avg_relevance + avg_completeness) / 3
    
    return {
        "faithfulness": avg_faithfulness,
        "relevance": avg_relevance,
        "completeness": avg_completeness,
        "overall": overall
    }

Human Evaluation

Gold Standard but Expensive

Cost:

100 evaluations × 5 minutes each = 500 minutes = 8.3 hours
At $20/hour = $166

vs LLM-as-judge:
100 evaluations × $0.01 each = $1

Use for Spot Checks

Strategy:

1. Evaluate 1000 examples with LLM-judge
2. Sample 100 for human evaluation
3. Compare human vs LLM scores
4. Validate LLM-judge is reliable

Validate LLM-Judge Correlation

Correlation:

python

from scipy.stats import pearsonr

human_scores = [4, 5, 3, 4, 2, ...]
llm_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...]

correlation, p_value = pearsonr(human_scores, llm_scores)
print(f"Correlation: {correlation:.2f}")  # Target: >0.7

Annotation Guidelines

Example:

markdown

# RAG Answer Evaluation Guidelines

## Faithfulness (1-5)
- Check if each claim in answer is supported by context
- 5 = All claims supported
- 1 = No claims supported (hallucination)

## Relevance (1-5)
- Does answer address the question?
- 5 = Perfectly addresses question
- 1 = Completely irrelevant

## Examples:
[Provide 5-10 examples with scores and explanations]

A/B Testing RAG Systems

Variant A vs Variant B

Setup:

Variant A: Current RAG system (baseline)
Variant B: New RAG system (improved retrieval)

Test: Same 100 questions

Same Questions, Different Systems

Process:

For each question:
  - Run through Variant A
  - Run through Variant B
  - Evaluate both answers
  - Compare metrics

Measure Metrics for Both

Metrics:

python

results_a = evaluate_rag(questions, answers_a, contexts_a, ground_truths)
results_b = evaluate_rag(questions, answers_b, contexts_b, ground_truths)

comparison = {
    "variant_a": {
        "faithfulness": results_a["avg_faithfulness"],
        "relevance": results_a["avg_relevance"]
    },
    "variant_b": {
        "faithfulness": results_b["avg_faithfulness"],
        "relevance": results_b["avg_relevance"]
    },
    "improvement": {
        "faithfulness": results_b["avg_faithfulness"] - results_a["avg_faithfulness"],
        "relevance": results_b["avg_relevance"] - results_a["avg_relevance"]
    }
}

Statistical Significance

T-Test:

python

from scipy.stats import ttest_rel

faithfulness_a = [r["faithfulness"] for r in results_a]
faithfulness_b = [r["faithfulness"] for r in results_b]

t_stat, p_value = ttest_rel(faithfulness_a, faithfulness_b)

if p_value < 0.05:
    print("Statistically significant improvement!")
else:
    print("No significant difference")

Retrieval Optimization

Tune Chunk Size

Experiment:

Chunk sizes: 256, 512, 1024, 2048 tokens
Measure: Context precision, context recall
Find optimal chunk size

Tune Number of Chunks (Top-k)

Experiment:

Top-k: 1, 3, 5, 10, 20
Measure: Context recall (more chunks = higher recall)
         Context precision (more chunks = lower precision)
Find optimal k (balance recall and precision)

Improve Embeddings (Fine-Tuning)

Approach:

1. Collect query-document pairs
2. Fine-tune embedding model
3. Evaluate retrieval quality
4. Deploy if improved

Hybrid Search (Keyword + Semantic)

Combination:

BM25 (keyword search) + Vector search (semantic)
Combine scores (e.g., 0.5 * BM25 + 0.5 * Vector)

Re-Ranking

Process:

1. Retrieve top-100 with fast retrieval (vector search)
2. Re-rank with slow but accurate model (cross-encoder)
3. Return top-5

Generation Optimization

Prompt Engineering

Improve Prompt:

Bad prompt:
"Answer the question based on the context."

Good prompt:
"Answer the question using only information from the context. 
If the answer is not in the context, say 'I don't know.'
Be concise and accurate."

Model Selection (GPT-4 vs Claude)

Comparison:

Test both models on same dataset
Measure: Faithfulness, relevance, cost, latency
Choose best for your use case

Temperature Tuning

Experiment:

Temperature: 0.0, 0.3, 0.7, 1.0
Measure: Faithfulness (lower temp = more faithful)
         Creativity (higher temp = more creative)

System Prompts

Example:

System: "You are a helpful assistant that answers questions based on provided context. 
Always cite the context when making claims. 
If the answer is not in the context, say so."

Continuous Evaluation

Log All Queries + Answers

Logging:

python

log_entry = {
    "timestamp": "2024-01-15T10:00:00Z",
    "question": "What is the capital of France?",
    "answer": "Paris is the capital of France.",
    "contexts": ["Paris is the capital..."],
    "latency_ms": 250,
    "user_id": "user123"
}

# Store in database or log file
db.logs.insert(log_entry)

Sample for Evaluation

Sampling:

python

# Sample 1% of queries for evaluation
sample = db.logs.aggregate([
    {"$sample": {"size": 100}}  # Random sample of 100
])

# Evaluate sample
evaluate_rag(sample)

Track Metrics Over Time

Dashboard:

Faithfulness over time:
Jan: 0.85
Feb: 0.87 ↑
Mar: 0.82 ↓ (regression!)
Apr: 0.90 ↑

Regression Detection

Alert:

python

if current_faithfulness < baseline_faithfulness - 0.05:
    send_alert("Faithfulness dropped by 5%!")

Real-World RAG Evaluation

Customer Support Chatbot

Metrics:

Faithfulness (no hallucination)
Answer relevance
Resolution rate (did user's issue get resolved?)
User satisfaction (thumbs up/down)

Technical Documentation Q&A

Metrics:

Correctness (accurate technical info)
Completeness (covers all aspects)
Code example quality (if applicable)

Legal Document Search

Metrics:

Precision (only relevant cases)
Recall (all relevant cases found)
Citation accuracy (correct case references)

Implementation

RAGAS Evaluation Script

See "RAGAS" section above

Custom Metrics

python

def custom_rag_eval(question, answer, context, ground_truth):
    return {
        "faithfulness": evaluate_faithfulness(context, answer),
        "relevance": evaluate_relevance(question, answer),
        "correctness": f1_score(answer, ground_truth),
        "context_precision": calculate_context_precision(question, context),
        "latency": measure_latency()
    }

LLM-as-Judge Prompts

See "LLM-as-Judge Patterns" section above

Summary

Quick Reference

RAG Evaluation: Measure retrieval + generation quality

Components:

Retrieval: Precision@k, Recall@k, MRR, NDCG
Context: Precision, recall, relevance
Generation: Faithfulness, relevance, correctness

Key Metrics:

Faithfulness: No hallucination
Answer relevance: On-topic
Context precision: Relevant chunks
Context recall: All needed info

Frameworks:

RAGAS (popular)
TruLens (real-time)
DeepEval (LLM-judge)
Custom scripts

Evaluation Methods:

LLM-as-judge (GPT-4, Claude)
NLI models (entailment)
Embedding similarity
Human evaluation (gold standard)

Optimization:

Retrieval: Chunk size, top-k, embeddings, hybrid search
Generation: Prompts, model, temperature

Continuous:

Log queries
Sample for evaluation
Track metrics
Detect regressions

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/rag-evaluation
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

RAG Evaluation

What is RAG Evaluation?

Components

Evaluation Levels

Why RAG Evaluation Matters

RAG Quality Varies Widely

Retrieval Errors → Wrong Context → Bad Answers

Need Metrics to Improve Systematically

RAG Components to Evaluate

1. Retrieval: Are Relevant Docs Retrieved?

2. Context: Is Context Sufficient for Answer?

3. Generation: Is Answer Correct, Relevant, Safe?

Retrieval Evaluation Metrics

Precision@k: % of Top-k Results Relevant

Recall@k: % of Relevant Docs in Top-k

MRR (Mean Reciprocal Rank): Position of First Relevant Doc

NDCG (Normalized Discounted Cumulative Gain)

Context Relevance Score

Generation Evaluation Metrics

Faithfulness: Answer Grounded in Context (No Hallucination)

Answer Relevance: Answer Addresses the Question

Correctness: Answer is Factually Correct

Completeness: Answer Covers All Aspects

Faithfulness Evaluation

Method 1: LLM-as-Judge (Does Answer Match Context?)

Method 2: NLI Model (Entailment Check)

Method 3: Citation Checking (Are Claims Cited?)

Answer Relevance Evaluation

LLM-as-Judge: "Does Answer Address Question?"

Semantic Similarity (Answer vs Expected Answer)

User Feedback (Thumbs Up/Down)

Correctness Evaluation

Ground Truth Comparison (If Available)

LLM-as-Judge with Rubric

Human Evaluation (Gold Standard)

RAG-Specific Metrics

Context Precision: Relevant Chunks in Context

Context Recall: All Needed Info Retrieved

Context Relevance: Context Relevance to Question

Answer Faithfulness: No Hallucinations

Answer Relevance: On-Topic Answer

Creating Evaluation Dataset

Question-Answer Pairs (Ground Truth)

Question-Context-Answer Triples

Diverse Questions (Simple, Complex, Multi-Hop)

Edge Cases (Ambiguous, No Answer)

Evaluation Frameworks

RAGAS (Popular Framework)

TruLens

DeepEval

Langfuse

Custom Evaluation Scripts

RAGAS Metrics

Context Precision

Context Recall

Faithfulness

Answer Relevance

Answer Semantic Similarity

Answer Correctness

LLM-as-Judge Patterns

Use GPT-4 or Claude to Grade Answers

Provide Rubric (1-5 Scale)

Check Faithfulness, Relevance, Quality

Aggregate Scores

Human Evaluation

Gold Standard but Expensive

Use for Spot Checks

Validate LLM-Judge Correlation

Annotation Guidelines

A/B Testing RAG Systems

Variant A vs Variant B

Same Questions, Different Systems

Measure Metrics for Both

Statistical Significance

Retrieval Optimization

Tune Chunk Size

Tune Number of Chunks (Top-k)