Agent skills
grey-haven-evaluation

Agent skill

grey-haven-evaluation

Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.

View SKILL.md on GitHub Repository

Stars 23

Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/greyhaven-ai/claude-code-config/tree/main/grey-haven-plugins/core/skills/evaluation

SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Examples (`examples/`)

Prompt comparison - A/B testing prompts with rubrics
Model evaluation - Comparing outputs across models
Regression testing - Detecting output degradation

Reference Guides (`reference/`)

Rubric design - Multi-dimensional evaluation criteria
LLM-as-judge - Using LLMs to evaluate LLM outputs
Statistical methods - Handling non-determinism

Templates (`templates/`)

Rubric templates - Ready-to-use evaluation criteria
Judge prompts - LLM-as-judge prompt templates
Test case format - Structured test case templates

Checklists (`checklists/`)

Evaluation setup - Before running evaluations
Rubric validation - Ensuring rubric quality

Key Concepts

1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

Dimension	Weight	Criteria
Accuracy	30%	Factually correct, no hallucinations
Completeness	25%	Addresses all requirements
Clarity	20%	Well-organized, easy to understand
Conciseness	15%	No unnecessary content
Format	10%	Follows specified structure

2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals

3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘

Best Practice: Use stronger model as judge (Opus judges Sonnet).

4. Test Case Design

Structure test cases with:

typescript

interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}

Evaluation Workflow

Step 1: Define Rubric

yaml

rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"

Step 2: Create Test Cases

yaml

test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }

Step 3: Run Evaluation

bash

# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘

Step 4: Analyze Results

Look for:

Low-scoring dimensions - Target for improvement
High-variance cases - Prompt needs clarification
Regression from baseline - Investigate changes

Grey Haven Integration

With TDD Workflow

1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?

With Pipeline Architecture

acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘

With Prompt Engineering

Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓

Use This Skill When

Testing new prompts before production
Comparing prompt variations (A/B testing)
Validating model outputs meet quality bar
Detecting regressions after changes
Building evaluation datasets
Implementing automated quality gates

Related Skills

prompt-engineering - Improve prompts based on evaluation
testing-strategy - Overall testing approaches
llm-project-development - Pipeline with evaluation stage

Quick Start

bash

# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md

Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15

Maintainer

greyhaven-ai Core maintainer

Source details

Full Name: greyhaven-ai/claude-code-config
Branch: main
Path in repo: grey-haven-plugins/core/skills/evaluation
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

greyhaven-ai/claude-code-config

grey-haven-prompt-engineering

Master 26 documented prompt engineering principles for crafting effective LLM prompts with 400%+ quality improvement. Includes templates, anti-patterns, and quality checklists for technical, learning, creative, and research tasks. Use when writing prompts for LLMs, improving AI response quality, training on prompting, designing agent instructions, or when user mentions 'prompt engineering', 'better prompts', 'LLM quality', 'prompt templates', 'AI prompts', 'prompt principles', or 'prompt optimization'.

23 2

Explore

greyhaven-ai/claude-code-config

grey-haven-tool-design

Design effective MCP tools and Claude Code integrations using the consolidation principle. Fewer, better-designed tools dramatically improve agent success rates. Use when creating MCP servers, designing tool interfaces, optimizing tool sets, or when user mentions 'tool design', 'MCP', 'fewer tools', 'tool consolidation', 'tool architecture', or 'tool optimization'.

23 2

Explore

greyhaven-ai/claude-code-config

grey-haven-documentation-alignment

6-phase verification system ensuring code matches documentation with automated alignment scoring (signature, type, behavior, error, example checks). Reduces onboarding friction 40%. Use when verifying code-docs alignment, onboarding developers, after code changes, pre-release documentation checks, or when user mentions 'docs out of sync', 'documentation verification', 'code-docs alignment', 'docs accuracy', 'documentation drift', or 'verify documentation'.

23 2

Explore

greyhaven-ai/claude-code-config

grey-haven-tdd-orchestration

Master TDD orchestration with multi-agent coordination, strict red-green-refactor enforcement, automated test generation, coverage tracking, and >90% coverage quality gates. Supports Claude Teams for parallel TDD workflows with plan approval gates, or falls back to sequential subagent coordination. Coordinates tdd-python, tdd-typescript, and test-generator agents. Use when implementing features with TDD workflow, coordinating multiple TDD agents, enforcing test-first development, orchestrating TDD teams, or when user mentions 'TDD workflow', 'test-first', 'TDD orchestration', 'multi-agent TDD', 'test coverage', or 'red-green-refactor'.

23 2

Explore

greyhaven-ai/claude-code-config

grey-haven-performance-optimization

Comprehensive performance analysis and optimization for algorithms (O(n²)→O(n)), databases (N+1 queries, indexes), React (memoization, virtual lists), bundles (code splitting), API caching, and memory leaks. 85%+ improvement rate. Use when application is slow, response times exceed SLA, high CPU/memory usage, performance budgets needed, or when user mentions 'performance', 'slow', 'optimization', 'bottleneck', 'speed up', 'latency', 'memory leak', or 'performance tuning'.

23 2

Explore

greyhaven-ai/claude-code-config

grey-haven-llm-project-development

Build LLM-powered applications and pipelines using proven methodology - task-model fit analysis, pipeline architecture, structured outputs, file-based state, and cost estimation. Use when building AI features, data processing pipelines, agents, or any LLM-integrated system. Inspired by Karpathy's methodology and production case studies.

23 2

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Evaluation Skill

Core Insight: The 95% Variance Finding

What's Included

Examples (examples/)

Reference Guides (reference/)

Templates (templates/)

Checklists (checklists/)

Key Concepts

1. Multi-Dimensional Rubrics

2. Handling Non-Determinism

3. LLM-as-Judge Pattern

4. Test Case Design

Evaluation Workflow

Step 1: Define Rubric

Step 2: Create Test Cases

Step 3: Run Evaluation

Step 4: Analyze Results

Grey Haven Integration

With TDD Workflow

With Pipeline Architecture

With Prompt Engineering

Use This Skill When

Related Skills

Quick Start

Recommended Agent Skills

grey-haven-prompt-engineering

grey-haven-tool-design

grey-haven-documentation-alignment

grey-haven-tdd-orchestration

grey-haven-performance-optimization

grey-haven-llm-project-development

Examples (`examples/`)

Reference Guides (`reference/`)

Templates (`templates/`)

Checklists (`checklists/`)