Agent skill

evaluation

Use when creating or updating agent evaluation suites. Defines eval structure, rubrics, and validation patterns.

Stars 0
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/craigtkhill/stdd-agents/tree/main/skills/evaluation

SKILL.md

Evaluation Skill

Guidelines for creating comprehensive evaluation suites.

When to Use This Skill

Use this skill when:

  • Creating a NEW evaluation suite for a feature
  • Updating an EXISTING evaluation suite
  • Understanding the evaluation framework patterns
  • Writing spec.yaml, rubric.md, or evaluation files

Evaluation Framework Overview

All evaluations in evals/ follow a consistent structure with both code-based and LLM-as-judge validations.

spec.yaml Template

Use this template for all eval spec.yaml files:

yaml
feature:
  name: "[Feature Name] Evaluation"
  as_a: evaluator
  i_want: validate feature behavior
  solutions:
    - Ground truth validation
    - Code-based validation
    - LLM-as-judge validation

requirements:
  - id: REQ-EVAL-XX-001
    eval: G
    description: Description of ground truth requirement
  - id: REQ-EVAL-XX-002
    eval: C
    description: Description of code-based requirement
  - id: REQ-EVAL-XX-003
    eval: L
    description: Description of LLM-judged requirement
  - id: REQ-EVAL-XX-004
    eval: O
    description: Description of planned requirement

Template Rules:

  • Identifier Format: REQ-EVAL-XX-NNN
    • XX = 2-3 letter eval abbreviation (e.g., AG for action_generation, AS for action_scenarios)
    • NNN = Sequential 3-digit number starting at 001
  • Implementation Types:
    • [G] = Ground truth validation (matches expected output)
    • [C] = Code-based validation (deterministic checks)
    • [L] = LLM-as-judge validation (quality assessment)
    • [O] = Not yet implemented (planned for future)
  • Categories: Group related requirements logically

rubric.md Template

Use this template for all rubric.md files:

markdown
# [Feature Name] Reasoning Trace Rubric

## Format
`[PASS/FAIL] RUBRIC-ID: Criterion description`

## Based on: [Concrete example with specific values]

### [Category Name]
- [ ] RUB-XX-001: Specific, objective criterion
- [ ] RUB-XX-002: Another specific criterion

Template Rules:

  • Identifier Format: RUB-XX-NNN (matches spec.yaml abbreviation)
  • Categories: Organize criteria into logical groups
  • Criteria: Write concrete, objectively verifiable rules, not subjective assessments
  • Specificity: Reference actual values, fields, or behaviors that can be checked
  • Checkboxes: Use - [ ] format for LLM judge to mark pass/fail
  • Avoid subjective language: Do not use vague terms; state exactly what to verify

Didn't find tool you were looking for?

Be as detailed as possible for better results