Agent skill
Ground Truth Management
Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/ground-truth-management
SKILL.md
Ground Truth Management
What is Ground Truth?
Definition: Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.
Example
Question: "What is the capital of France?"
Ground Truth: "Paris"
AI Answer: "Paris" → Correct ✓
AI Answer: "Lyon" → Incorrect ✗
Why Ground Truth Matters
Measure Accuracy Objectively
Without ground truth: "This answer seems good" (subjective)
With ground truth: "Accuracy: 85%" (objective)
Train and Validate Models
Training: Learn from ground truth examples
Validation: Measure performance on ground truth test set
Regression Testing
Before change: Accuracy 90%
After change: Accuracy 85%
→ Regression detected!
Benchmarking
Model A: 90% accuracy on ground truth
Model B: 85% accuracy on ground truth
→ Model A is better
Types of Ground Truth
Exact Match: Single Correct Answer
{
"question": "What is 2+2?",
"answer": "4"
}
Multiple Acceptable Answers
{
"question": "What is the capital of France?",
"acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"]
}
Rubric-Based: Quality Scale
{
"question": "Summarize this article",
"rubric": {
"1": "Poor summary, missing key points",
"3": "Adequate summary, covers main points",
"5": "Excellent summary, concise and comprehensive"
}
}
Human Preference: Comparison Rankings
{
"question": "Which answer is better?",
"answer_a": "Paris is the capital of France.",
"answer_b": "The capital of France is Paris, a city of 2.1 million people.",
"preference": "B",
"reasoning": "More informative"
}
Creating Ground Truth
Manual Annotation (Humans Label)
Process:
1. Collect examples (questions, documents, images)
2. Human annotators label each
3. Quality control (review annotations)
4. Store in dataset
Expert Review (For Specialized Domains)
Medical: Doctors annotate
Legal: Lawyers annotate
Technical: Engineers annotate
Higher quality but more expensive
Crowdsourcing (Amazon MTurk)
Pros:
- Fast (many workers)
- Cheap ($0.10-1.00 per annotation)
Cons:
- Variable quality
- Need quality control
Synthetic Generation (For Some Tasks)
LLM-generated questions + answers
Careful validation needed
Good for scale, risky for quality
Use for augmentation, not sole source
Ground Truth Dataset Structure
Input (Question, Document, Image)
{
"input": {
"type": "question",
"text": "What is the capital of France?"
}
}
Expected Output (Answer, Label, Summary)
{
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS"]
}
}
Metadata (Difficulty, Category, Source)
{
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
}
}
Annotation Info (Who, When, Confidence)
{
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95,
"time_spent_seconds": 30
}
}
Complete Example:
{
"id": "example_001",
"input": {
"type": "question",
"text": "What is the capital of France?"
},
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
},
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
},
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95
}
}
Annotation Guidelines
Clear Instructions
# Annotation Guidelines
## Task
Label whether the answer is correct.
## Instructions
1. Read the question carefully
2. Read the answer
3. Determine if answer is factually correct
4. Mark as "Correct" or "Incorrect"
## Examples
Question: "What is 2+2?"
Answer: "4"
Label: Correct
Question: "What is 2+2?"
Answer: "5"
Label: Incorrect
Examples (Good and Bad)
## Good Example
Question: "What is the capital of France?"
Answer: "Paris"
Label: Correct
Reasoning: Factually accurate and directly answers question
## Bad Example
Question: "What is the capital of France?"
Answer: "France is a country in Europe"
Label: Incorrect
Reasoning: Doesn't answer the question
Edge Case Handling
## Edge Cases
### Partially Correct
Question: "What are the capitals of France and Germany?"
Answer: "Paris"
Label: Partially Correct (missing Germany)
### Ambiguous Question
Question: "What is the best programming language?"
Label: N/A - Subjective question, no single correct answer
### No Answer in Context
Question: "What is the population of Paris?"
Context: "Paris is the capital of France."
Label: "Cannot be determined from context"
Consistency Checks
## Consistency Rules
1. Same question → Same answer
2. Synonyms are acceptable ("car" = "automobile")
3. Case-insensitive ("Paris" = "paris")
4. Extra details are OK ("Paris" vs "Paris, France")
Quality Control
Multiple Annotators Per Example
Each example labeled by 3 annotators
Majority vote determines final label
Catches individual annotator errors
Inter-Annotator Agreement (IAA)
Measure: Do annotators agree?
Metric: Cohen's Kappa (κ)
Target: κ > 0.7 (good agreement)
Gold Standard Subset (Known Answers)
10% of examples have known correct labels
Mix into annotation tasks
Measure annotator accuracy on gold standard
Remove low-quality annotators
Spot Checks by Experts
Expert reviews 10% of annotations
Validates quality
Identifies systematic errors
Inter-Annotator Agreement
Kappa Score (Cohen's κ)
from sklearn.metrics import cohen_kappa_score
annotator1 = [1, 0, 1, 1, 0] # Labels from annotator 1
annotator2 = [1, 0, 1, 0, 0] # Labels from annotator 2
kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Kappa: {kappa:.2f}")
# Interpretation:
# κ < 0.4: Poor agreement
# κ 0.4-0.6: Moderate agreement
# κ 0.6-0.8: Good agreement
# κ > 0.8: Excellent agreement
Fleiss' κ (Multiple Annotators)
from statsmodels.stats.inter_rater import fleiss_kappa
# 3 annotators, 5 examples
# Each row: [count_label_0, count_label_1]
data = [
[0, 3], # Example 1: All 3 annotators chose label 1
[1, 2], # Example 2: 1 chose 0, 2 chose 1
[3, 0], # Example 3: All 3 chose label 0
[2, 1], # Example 4: 2 chose 0, 1 chose 1
[0, 3], # Example 5: All 3 chose label 1
]
kappa = fleiss_kappa(data)
print(f"Fleiss' Kappa: {kappa:.2f}")
Percentage Agreement
def percentage_agreement(annotator1, annotator2):
agreements = sum(a == b for a, b in zip(annotator1, annotator2))
total = len(annotator1)
return agreements / total
agreement = percentage_agreement(annotator1, annotator2)
print(f"Agreement: {agreement:.1%}")
Target: >0.7 (Good Agreement)
If κ < 0.7:
1. Review annotation guidelines (unclear?)
2. Provide more examples
3. Train annotators
4. Simplify task
Resolving Disagreements
Majority Vote
def majority_vote(labels):
from collections import Counter
counts = Counter(labels)
majority_label = counts.most_common(1)[0][0]
return majority_label
# 3 annotators
labels = [1, 1, 0] # Two say 1, one says 0
final_label = majority_vote(labels) # 1
Expert Adjudication
If no majority (e.g., 1, 0, 2):
→ Expert reviews and decides
Discussion and Consensus
Annotators discuss disagreement
Reach consensus
Update guidelines if needed
Update Guidelines
If systematic disagreements:
→ Guidelines unclear
→ Update and re-annotate
Ground Truth for Different Tasks
Classification: Category Labels
{
"text": "This product is amazing!",
"label": "positive"
}
Q&A: Correct Answers + Acceptable Variants
{
"question": "What is the capital of France?",
"answer": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
}
Summarization: Reference Summaries
{
"document": "Long article text...",
"reference_summary": "Concise summary of key points"
}
RAG: Question + Context + Answer
{
"question": "What is the capital of France?",
"context": "Paris is the capital and largest city of France.",
"answer": "Paris",
"relevant_chunks": ["Paris is the capital and largest city of France."]
}
Generation: Multiple Acceptable Outputs
{
"prompt": "Write a haiku about spring",
"acceptable_outputs": [
"Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now",
"Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last"
]
}
Dataset Size
Evaluation Set: 100-1000 Examples (Representative)
Purpose: Quick evaluation during development
Size: 100-1000 examples
Quality: High (manually curated)
Coverage: Representative of production
Test Set: 500-5000 Examples (Comprehensive)
Purpose: Final evaluation before deployment
Size: 500-5000 examples
Quality: High (gold standard)
Coverage: Comprehensive (all categories, edge cases)
Quality > Quantity
Better: 100 high-quality examples
Worse: 1000 low-quality examples
Cover Edge Cases
Include:
- Common cases (80%)
- Edge cases (15%)
- Adversarial cases (5%)
Dataset Maintenance
Version Control (Like Code)
# Git for dataset versioning
git init
git add dataset.jsonl
git commit -m "Initial dataset v1.0"
# Tag versions
git tag v1.0
# Update dataset
git add dataset.jsonl
git commit -m "Added 100 new examples"
git tag v1.1
Regular Updates (New Examples)
Monthly: Add 50-100 new examples
Quarterly: Major update (500+ examples)
Remove Outdated Examples
Examples that are:
- No longer relevant
- Incorrect (facts changed)
- Duplicates
Track Changes (Changelog)
# Dataset Changelog
## v1.2 (2024-02-01)
- Added 100 new examples (geography category)
- Removed 20 outdated examples
- Fixed 5 incorrect labels
## v1.1 (2024-01-01)
- Added 50 new examples (science category)
- Updated annotation guidelines
## v1.0 (2023-12-01)
- Initial release (500 examples)
Stratified Sampling
Balance by Difficulty
Easy: 40%
Medium: 40%
Hard: 20%
Balance by Category
Geography: 25%
Science: 25%
History: 25%
Math: 25%
Include Edge Cases
Common cases: 80%
Edge cases: 15%
Adversarial: 5%
Representative of Production
Sample from actual production queries
Ensures dataset matches real usage
Synthetic Ground Truth
LLM-Generated Questions + Answers
def generate_synthetic_qa(document):
prompt = f"""
Document: {document}
Generate 5 question-answer pairs based on this document.
Format:
Q1: [question]
A1: [answer]
...
"""
response = llm.generate(prompt)
qa_pairs = parse_qa_pairs(response)
return qa_pairs
Careful Validation Needed
LLM-generated data can have:
- Hallucinations
- Incorrect facts
- Biased questions
→ Always validate with humans
Good for Scale, Risky for Quality
Pros: Can generate 1000s quickly
Cons: Quality varies, needs validation
Use for Augmentation, Not Sole Source
Strategy:
- 80% human-annotated (high quality)
- 20% synthetic (validated)
Domain-Specific Ground Truth
Medical: Expert Annotations
Annotators: Licensed doctors
Cost: $50-100 per hour
Quality: Very high
Use case: Medical diagnosis, treatment recommendations
Legal: Lawyer Review
Annotators: Licensed lawyers
Cost: $100-300 per hour
Quality: Very high
Use case: Legal document analysis, case law
Technical: Engineer Verification
Annotators: Senior engineers
Cost: $50-150 per hour
Quality: High
Use case: Code review, technical Q&A
Ground Truth Storage
JSON/JSONL Files
{"id": "1", "question": "What is 2+2?", "answer": "4"}
{"id": "2", "question": "Capital of France?", "answer": "Paris"}
Database (PostgreSQL, MongoDB)
CREATE TABLE ground_truth (
id UUID PRIMARY KEY,
question TEXT NOT NULL,
answer TEXT NOT NULL,
category VARCHAR(50),
difficulty VARCHAR(20),
created_at TIMESTAMP DEFAULT NOW()
);
Version Control (Git)
git add dataset/
git commit -m "Update ground truth dataset"
git push
Cloud Storage (S3 + Versioning)
# Upload to S3 with versioning
aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl
aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
Ground Truth for RAG
Structure:
{
"question": "What is the capital of France?",
"expected_answer": "Paris",
"relevant_document_chunks": [
"Paris is the capital and largest city of France."
],
"evaluation_criteria": {
"faithfulness": "Answer must be grounded in context",
"relevance": "Answer must directly address question",
"completeness": "Answer should mention Paris"
}
}
Evaluation with Ground Truth
Exact Match Accuracy
def exact_match(predicted, ground_truth):
return predicted.strip().lower() == ground_truth.strip().lower()
accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)
F1 Score (For Overlapping Spans)
def f1_score(predicted, ground_truth):
pred_tokens = set(predicted.lower().split())
gt_tokens = set(ground_truth.lower().split())
common = pred_tokens & gt_tokens
if len(pred_tokens) == 0 or len(gt_tokens) == 0:
return 0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gt_tokens)
if precision + recall == 0:
return 0
f1 = 2 * (precision * recall) / (precision + recall)
return f1
BLEU/ROUGE (For Generation)
from nltk.translate.bleu_score import sentence_bleu
reference = [["Paris", "is", "the", "capital"]]
candidate = ["Paris", "is", "the", "capital"]
bleu = sentence_bleu(reference, candidate)
Semantic Similarity (Embedding Distance)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode("Paris is the capital of France")
emb2 = model.encode("The capital of France is Paris")
similarity = cosine_similarity([emb1], [emb2])[0][0]
Continuous Ground Truth
Production Feedback (User Thumbs Up/Down)
# Log user feedback
feedback = {
"question": "What is the capital of France?",
"answer": "Paris",
"user_feedback": "thumbs_up",
"timestamp": "2024-01-15T10:00:00Z"
}
# Add to ground truth if positive
if feedback["user_feedback"] == "thumbs_up":
add_to_ground_truth(feedback["question"], feedback["answer"])
Human Review of Flagged Outputs
User flags answer as incorrect
→ Human reviews
→ If incorrect, add correct answer to ground truth
→ If correct, keep as is
Incrementally Add to Dataset
Monthly: Review 100 flagged examples
Add 50 to ground truth
Update dataset version
Tools
Annotation: Label Studio, Prodigy, CVAT
Label Studio:
pip install label-studio
label-studio start
# Open http://localhost:8080
Prodigy:
pip install prodigy
prodigy textcat.manual dataset_name source.jsonl --label positive,negative
Management: DVC (Data Version Control)
pip install dvc
dvc init
dvc add dataset.jsonl
git add dataset.jsonl.dvc .gitignore
git commit -m "Add dataset"
dvc push
Storage: S3, GCS, Local Files
See "Ground Truth Storage" section
Summary
Ground Truth: Correct answers for evaluation
Why:
- Measure accuracy objectively
- Train/validate models
- Regression testing
- Benchmarking
Types:
- Exact match
- Multiple acceptable answers
- Rubric-based
- Human preference
Creating:
- Manual annotation
- Expert review
- Crowdsourcing
- Synthetic (with validation)
Quality Control:
- Multiple annotators
- Inter-annotator agreement (κ > 0.7)
- Gold standard subset
- Expert spot checks
Dataset Size:
- Eval: 100-1000 (representative)
- Test: 500-5000 (comprehensive)
- Quality > quantity
Maintenance:
- Version control (Git)
- Regular updates
- Remove outdated
- Changelog
Tools:
- Annotation: Label Studio, Prodigy
- Management: DVC
- Storage: S3, GCS, Git
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?