Agent skills
Ground Truth Management

Agent skill

Ground Truth Management

Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/ground-truth-management

SKILL.md

Ground Truth Management

What is Ground Truth?

Definition: Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.

Example

Question: "What is the capital of France?"
Ground Truth: "Paris"

AI Answer: "Paris" → Correct ✓
AI Answer: "Lyon" → Incorrect ✗

Why Ground Truth Matters

Measure Accuracy Objectively

Without ground truth: "This answer seems good" (subjective)
With ground truth: "Accuracy: 85%" (objective)

Train and Validate Models

Training: Learn from ground truth examples
Validation: Measure performance on ground truth test set

Regression Testing

Before change: Accuracy 90%
After change: Accuracy 85%
→ Regression detected!

Benchmarking

Model A: 90% accuracy on ground truth
Model B: 85% accuracy on ground truth
→ Model A is better

Types of Ground Truth

Exact Match: Single Correct Answer

json

{
  "question": "What is 2+2?",
  "answer": "4"
}

Multiple Acceptable Answers

json

{
  "question": "What is the capital of France?",
  "acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"]
}

Rubric-Based: Quality Scale

json

{
  "question": "Summarize this article",
  "rubric": {
    "1": "Poor summary, missing key points",
    "3": "Adequate summary, covers main points",
    "5": "Excellent summary, concise and comprehensive"
  }
}

Human Preference: Comparison Rankings

json

{
  "question": "Which answer is better?",
  "answer_a": "Paris is the capital of France.",
  "answer_b": "The capital of France is Paris, a city of 2.1 million people.",
  "preference": "B",
  "reasoning": "More informative"
}

Creating Ground Truth

Manual Annotation (Humans Label)

Process:
1. Collect examples (questions, documents, images)
2. Human annotators label each
3. Quality control (review annotations)
4. Store in dataset

Expert Review (For Specialized Domains)

Medical: Doctors annotate
Legal: Lawyers annotate
Technical: Engineers annotate

Higher quality but more expensive

Crowdsourcing (Amazon MTurk)

Pros:
- Fast (many workers)
- Cheap ($0.10-1.00 per annotation)

Cons:
- Variable quality
- Need quality control

Synthetic Generation (For Some Tasks)

LLM-generated questions + answers
Careful validation needed
Good for scale, risky for quality
Use for augmentation, not sole source

Ground Truth Dataset Structure

Input (Question, Document, Image)

json

{
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  }
}

Expected Output (Answer, Label, Summary)

json

{
  "expected_output": {
    "type": "answer",
    "text": "Paris",
    "acceptable_variants": ["paris", "PARIS"]
  }
}

Metadata (Difficulty, Category, Source)

json

{
  "metadata": {
    "difficulty": "easy",
    "category": "geography",
    "source": "wikipedia",
    "language": "en"
  }
}

Annotation Info (Who, When, Confidence)

json

{
  "annotation": {
    "annotator_id": "annotator_123",
    "timestamp": "2024-01-15T10:00:00Z",
    "confidence": 0.95,
    "time_spent_seconds": 30
  }
}

Complete Example:

json

{
  "id": "example_001",
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  },
  "expected_output": {
    "type": "answer",
    "text": "Paris",
    "acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
  },
  "metadata": {
    "difficulty": "easy",
    "category": "geography",
    "source": "wikipedia",
    "language": "en"
  },
  "annotation": {
    "annotator_id": "annotator_123",
    "timestamp": "2024-01-15T10:00:00Z",
    "confidence": 0.95
  }
}

Annotation Guidelines

Clear Instructions

markdown

# Annotation Guidelines

## Task
Label whether the answer is correct.

## Instructions
1. Read the question carefully
2. Read the answer
3. Determine if answer is factually correct
4. Mark as "Correct" or "Incorrect"

## Examples
Question: "What is 2+2?"
Answer: "4"
Label: Correct

Question: "What is 2+2?"
Answer: "5"
Label: Incorrect

Examples (Good and Bad)

markdown

## Good Example
Question: "What is the capital of France?"
Answer: "Paris"
Label: Correct
Reasoning: Factually accurate and directly answers question

## Bad Example
Question: "What is the capital of France?"
Answer: "France is a country in Europe"
Label: Incorrect
Reasoning: Doesn't answer the question

Edge Case Handling

markdown

## Edge Cases

### Partially Correct
Question: "What are the capitals of France and Germany?"
Answer: "Paris"
Label: Partially Correct (missing Germany)

### Ambiguous Question
Question: "What is the best programming language?"
Label: N/A - Subjective question, no single correct answer

### No Answer in Context
Question: "What is the population of Paris?"
Context: "Paris is the capital of France."
Label: "Cannot be determined from context"

Consistency Checks

markdown

## Consistency Rules

1. Same question → Same answer
2. Synonyms are acceptable ("car" = "automobile")
3. Case-insensitive ("Paris" = "paris")
4. Extra details are OK ("Paris" vs "Paris, France")

Quality Control

Multiple Annotators Per Example

Each example labeled by 3 annotators
Majority vote determines final label
Catches individual annotator errors

Inter-Annotator Agreement (IAA)

Measure: Do annotators agree?
Metric: Cohen's Kappa (κ)
Target: κ > 0.7 (good agreement)

Gold Standard Subset (Known Answers)

10% of examples have known correct labels
Mix into annotation tasks
Measure annotator accuracy on gold standard
Remove low-quality annotators

Spot Checks by Experts

Expert reviews 10% of annotations
Validates quality
Identifies systematic errors

Inter-Annotator Agreement

Kappa Score (Cohen's κ)

python

from sklearn.metrics import cohen_kappa_score

annotator1 = [1, 0, 1, 1, 0]  # Labels from annotator 1
annotator2 = [1, 0, 1, 0, 0]  # Labels from annotator 2

kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Kappa: {kappa:.2f}")

# Interpretation:
# κ < 0.4: Poor agreement
# κ 0.4-0.6: Moderate agreement
# κ 0.6-0.8: Good agreement
# κ > 0.8: Excellent agreement

Fleiss' κ (Multiple Annotators)

python

from statsmodels.stats.inter_rater import fleiss_kappa

# 3 annotators, 5 examples
# Each row: [count_label_0, count_label_1]
data = [
    [0, 3],  # Example 1: All 3 annotators chose label 1
    [1, 2],  # Example 2: 1 chose 0, 2 chose 1
    [3, 0],  # Example 3: All 3 chose label 0
    [2, 1],  # Example 4: 2 chose 0, 1 chose 1
    [0, 3],  # Example 5: All 3 chose label 1
]

kappa = fleiss_kappa(data)
print(f"Fleiss' Kappa: {kappa:.2f}")

Percentage Agreement

python

def percentage_agreement(annotator1, annotator2):
    agreements = sum(a == b for a, b in zip(annotator1, annotator2))
    total = len(annotator1)
    return agreements / total

agreement = percentage_agreement(annotator1, annotator2)
print(f"Agreement: {agreement:.1%}")

Target: >0.7 (Good Agreement)

If κ < 0.7:
1. Review annotation guidelines (unclear?)
2. Provide more examples
3. Train annotators
4. Simplify task

Resolving Disagreements

Majority Vote

python

def majority_vote(labels):
    from collections import Counter
    counts = Counter(labels)
    majority_label = counts.most_common(1)[0][0]
    return majority_label

# 3 annotators
labels = [1, 1, 0]  # Two say 1, one says 0
final_label = majority_vote(labels)  # 1

Expert Adjudication

If no majority (e.g., 1, 0, 2):
→ Expert reviews and decides

Discussion and Consensus

Annotators discuss disagreement
Reach consensus
Update guidelines if needed

Update Guidelines

If systematic disagreements:
→ Guidelines unclear
→ Update and re-annotate

Ground Truth for Different Tasks

Classification: Category Labels

json

{
  "text": "This product is amazing!",
  "label": "positive"
}

Q&A: Correct Answers + Acceptable Variants

json

{
  "question": "What is the capital of France?",
  "answer": "Paris",
  "acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
}

Summarization: Reference Summaries

json

{
  "document": "Long article text...",
  "reference_summary": "Concise summary of key points"
}

RAG: Question + Context + Answer

json

{
  "question": "What is the capital of France?",
  "context": "Paris is the capital and largest city of France.",
  "answer": "Paris",
  "relevant_chunks": ["Paris is the capital and largest city of France."]
}

Generation: Multiple Acceptable Outputs

json

{
  "prompt": "Write a haiku about spring",
  "acceptable_outputs": [
    "Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now",
    "Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last"
  ]
}

Dataset Size

Evaluation Set: 100-1000 Examples (Representative)

Purpose: Quick evaluation during development
Size: 100-1000 examples
Quality: High (manually curated)
Coverage: Representative of production

Test Set: 500-5000 Examples (Comprehensive)

Purpose: Final evaluation before deployment
Size: 500-5000 examples
Quality: High (gold standard)
Coverage: Comprehensive (all categories, edge cases)

Quality > Quantity

Better: 100 high-quality examples
Worse: 1000 low-quality examples

Cover Edge Cases

Include:
- Common cases (80%)
- Edge cases (15%)
- Adversarial cases (5%)

Dataset Maintenance

Version Control (Like Code)

bash

# Git for dataset versioning
git init
git add dataset.jsonl
git commit -m "Initial dataset v1.0"

# Tag versions
git tag v1.0

# Update dataset
git add dataset.jsonl
git commit -m "Added 100 new examples"
git tag v1.1

Regular Updates (New Examples)

Monthly: Add 50-100 new examples
Quarterly: Major update (500+ examples)

Remove Outdated Examples

Examples that are:
- No longer relevant
- Incorrect (facts changed)
- Duplicates

Track Changes (Changelog)

markdown

# Dataset Changelog

## v1.2 (2024-02-01)
- Added 100 new examples (geography category)
- Removed 20 outdated examples
- Fixed 5 incorrect labels

## v1.1 (2024-01-01)
- Added 50 new examples (science category)
- Updated annotation guidelines

## v1.0 (2023-12-01)
- Initial release (500 examples)

Stratified Sampling

Balance by Difficulty

Easy: 40%
Medium: 40%
Hard: 20%

Balance by Category

Geography: 25%
Science: 25%
History: 25%
Math: 25%

Include Edge Cases

Common cases: 80%
Edge cases: 15%
Adversarial: 5%

Representative of Production

Sample from actual production queries
Ensures dataset matches real usage

Synthetic Ground Truth

LLM-Generated Questions + Answers

python

def generate_synthetic_qa(document):
    prompt = f"""
    Document: {document}
    
    Generate 5 question-answer pairs based on this document.
    
    Format:
    Q1: [question]
    A1: [answer]
    ...
    """
    
    response = llm.generate(prompt)
    qa_pairs = parse_qa_pairs(response)
    return qa_pairs

Careful Validation Needed

LLM-generated data can have:
- Hallucinations
- Incorrect facts
- Biased questions

→ Always validate with humans

Good for Scale, Risky for Quality

Pros: Can generate 1000s quickly
Cons: Quality varies, needs validation

Use for Augmentation, Not Sole Source

Strategy:
- 80% human-annotated (high quality)
- 20% synthetic (validated)

Domain-Specific Ground Truth

Medical: Expert Annotations

Annotators: Licensed doctors
Cost: $50-100 per hour
Quality: Very high
Use case: Medical diagnosis, treatment recommendations

Legal: Lawyer Review

Annotators: Licensed lawyers
Cost: $100-300 per hour
Quality: Very high
Use case: Legal document analysis, case law

Technical: Engineer Verification

Annotators: Senior engineers
Cost: $50-150 per hour
Quality: High
Use case: Code review, technical Q&A

Ground Truth Storage

JSON/JSONL Files

jsonl

{"id": "1", "question": "What is 2+2?", "answer": "4"}
{"id": "2", "question": "Capital of France?", "answer": "Paris"}

Database (PostgreSQL, MongoDB)

sql

CREATE TABLE ground_truth (
  id UUID PRIMARY KEY,
  question TEXT NOT NULL,
  answer TEXT NOT NULL,
  category VARCHAR(50),
  difficulty VARCHAR(20),
  created_at TIMESTAMP DEFAULT NOW()
);

Version Control (Git)

bash

git add dataset/
git commit -m "Update ground truth dataset"
git push

Cloud Storage (S3 + Versioning)

bash

# Upload to S3 with versioning
aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl
aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled

Ground Truth for RAG

Structure:

json

{
  "question": "What is the capital of France?",
  "expected_answer": "Paris",
  "relevant_document_chunks": [
    "Paris is the capital and largest city of France."
  ],
  "evaluation_criteria": {
    "faithfulness": "Answer must be grounded in context",
    "relevance": "Answer must directly address question",
    "completeness": "Answer should mention Paris"
  }
}

Evaluation with Ground Truth

Exact Match Accuracy

python

def exact_match(predicted, ground_truth):
    return predicted.strip().lower() == ground_truth.strip().lower()

accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)

F1 Score (For Overlapping Spans)

python

def f1_score(predicted, ground_truth):
    pred_tokens = set(predicted.lower().split())
    gt_tokens = set(ground_truth.lower().split())
    
    common = pred_tokens & gt_tokens
    if len(pred_tokens) == 0 or len(gt_tokens) == 0:
        return 0
    
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gt_tokens)
    
    if precision + recall == 0:
        return 0
    
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

BLEU/ROUGE (For Generation)

python

from nltk.translate.bleu_score import sentence_bleu

reference = [["Paris", "is", "the", "capital"]]
candidate = ["Paris", "is", "the", "capital"]

bleu = sentence_bleu(reference, candidate)

Semantic Similarity (Embedding Distance)

python

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

emb1 = model.encode("Paris is the capital of France")
emb2 = model.encode("The capital of France is Paris")

similarity = cosine_similarity([emb1], [emb2])[0][0]

Continuous Ground Truth

Production Feedback (User Thumbs Up/Down)

python

# Log user feedback
feedback = {
    "question": "What is the capital of France?",
    "answer": "Paris",
    "user_feedback": "thumbs_up",
    "timestamp": "2024-01-15T10:00:00Z"
}

# Add to ground truth if positive
if feedback["user_feedback"] == "thumbs_up":
    add_to_ground_truth(feedback["question"], feedback["answer"])

Human Review of Flagged Outputs

User flags answer as incorrect
→ Human reviews
→ If incorrect, add correct answer to ground truth
→ If correct, keep as is

Incrementally Add to Dataset

Monthly: Review 100 flagged examples
Add 50 to ground truth
Update dataset version

Tools

Annotation: Label Studio, Prodigy, CVAT

Label Studio:

bash

pip install label-studio
label-studio start
# Open http://localhost:8080

Prodigy:

bash

pip install prodigy
prodigy textcat.manual dataset_name source.jsonl --label positive,negative

Management: DVC (Data Version Control)

bash

pip install dvc
dvc init
dvc add dataset.jsonl
git add dataset.jsonl.dvc .gitignore
git commit -m "Add dataset"
dvc push

Storage: S3, GCS, Local Files

See "Ground Truth Storage" section

Summary

Ground Truth: Correct answers for evaluation

Why:

Measure accuracy objectively
Train/validate models
Regression testing
Benchmarking

Types:

Exact match
Multiple acceptable answers
Rubric-based
Human preference

Creating:

Manual annotation
Expert review
Crowdsourcing
Synthetic (with validation)

Quality Control:

Multiple annotators
Inter-annotator agreement (κ > 0.7)
Gold standard subset
Expert spot checks

Dataset Size:

Eval: 100-1000 (representative)
Test: 500-5000 (comprehensive)
Quality > quantity

Maintenance:

Version control (Git)
Regular updates
Remove outdated
Changelog

Tools:

Annotation: Label Studio, Prodigy
Management: DVC
Storage: S3, GCS, Git

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/ground-truth-management
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Ground Truth Management

What is Ground Truth?

Example

Why Ground Truth Matters

Measure Accuracy Objectively

Train and Validate Models

Regression Testing

Benchmarking

Types of Ground Truth

Exact Match: Single Correct Answer

Multiple Acceptable Answers

Rubric-Based: Quality Scale

Human Preference: Comparison Rankings

Creating Ground Truth

Manual Annotation (Humans Label)

Expert Review (For Specialized Domains)

Crowdsourcing (Amazon MTurk)

Synthetic Generation (For Some Tasks)

Ground Truth Dataset Structure

Input (Question, Document, Image)

Expected Output (Answer, Label, Summary)

Metadata (Difficulty, Category, Source)

Annotation Info (Who, When, Confidence)

Annotation Guidelines

Clear Instructions

Examples (Good and Bad)

Edge Case Handling

Consistency Checks

Quality Control

Multiple Annotators Per Example

Inter-Annotator Agreement (IAA)

Gold Standard Subset (Known Answers)

Spot Checks by Experts

Inter-Annotator Agreement

Kappa Score (Cohen's κ)

Fleiss' κ (Multiple Annotators)

Percentage Agreement

Target: >0.7 (Good Agreement)

Resolving Disagreements

Majority Vote

Expert Adjudication

Discussion and Consensus

Update Guidelines

Ground Truth for Different Tasks

Classification: Category Labels

Q&A: Correct Answers + Acceptable Variants

Summarization: Reference Summaries

RAG: Question + Context + Answer

Generation: Multiple Acceptable Outputs

Dataset Size

Evaluation Set: 100-1000 Examples (Representative)

Test Set: 500-5000 Examples (Comprehensive)

Quality > Quantity

Cover Edge Cases

Dataset Maintenance

Version Control (Like Code)

Regular Updates (New Examples)

Remove Outdated Examples

Track Changes (Changelog)

Stratified Sampling

Balance by Difficulty

Balance by Category

Include Edge Cases

Representative of Production

Synthetic Ground Truth

LLM-Generated Questions + Answers

Careful Validation Needed

Good for Scale, Risky for Quality

Use for Augmentation, Not Sole Source

Domain-Specific Ground Truth

Medical: Expert Annotations

Legal: Lawyer Review

Technical: Engineer Verification

Ground Truth Storage

JSON/JSONL Files

Database (PostgreSQL, MongoDB)

Version Control (Git)