Agent skill

eval-run

Evaluate any output file against a structured evals.yaml assertions file and produce a score report with per-assertion pass/fail results. Activate when the Discovery Agent runs the Skill Optimize protocol to measure output quality or detect regressions after skill instruction changes.

View SKILL.md on GitHub Repository

Stars 123

Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/cross/eval-run

Metadata

Additional technical details for this skill

id: SKILL-CRS-025
track: cross-cutting
author: gaai-framework
status: experimental
version: 1.0
category: cross
updated at: 1773532800

SKILL.md

Eval Run

Purpose / When to Activate

Activate when:

The Discovery Agent runs the Skill Optimize protocol and needs to score a skill output
A skill's instructions have been modified and a before/after quality comparison is needed
A baseline score is being established for a skill that has never been evaluated

This skill is generic: it accepts any output file and any evals.yaml, regardless of skill domain.

It follows the GAAI principle "skills never chain" — it evaluates the output it receives; it does not invoke the skill that produced the output.

Process

Step 1 — Load inputs

Read the output_file path. Confirm the file exists and is non-empty. If missing: FAIL immediately with error "output_file not found: {path}".
Read the evals_file path. Confirm the file exists and is valid YAML. If missing: FAIL immediately with error "evals_file not found: {path}".
Parse the evals.yaml structure. Validate:
- skill, version, description, and assertions fields are present
- assertions list is non-empty
- Each assertion has id, type, and description fields
- If any required field is missing: FAIL with error "evals.yaml validation error: {details}"

For the full evals.yaml format spec, see references/evals-format.md.

Step 2 — Run `code` assertions

For each assertion where type: code:

Read the check field. Execute the corresponding mechanical verification:

`check`	Verification method
`word_count`	Count whitespace-separated tokens in the output file. Compare against `params.min` and `params.max`.
`char_count`	Count all characters in the output file. Compare against `params.min` and `params.max`.
`regex_match`	Apply `params.pattern` as a regex to the full output text. PASS if at least one match found.
`regex_not_match`	Apply `params.pattern` as a regex to the full output text. PASS if zero matches found.
`structure_present`	Search the output text for the literal string `params.marker`. PASS if found.
`structure_absent`	Search the output text for the literal string `params.marker`. PASS if NOT found.

Record the result:
- PASS: the assertion result is PASS with the measured value (e.g., word count = 1247)
- FAIL: the assertion result is FAIL with the measured value and the expected condition

Step 3 — Run `llm-judge` assertions

For each assertion where type: llm-judge:

Construct the evaluation prompt:

{assertion.prompt}

---
OUTPUT TO EVALUATE:
{full content of output_file}

Submit the prompt. Parse the response for a binary verdict: PASS or FAIL.
Extract the one-sentence explanation from the response.
Record the result:
- PASS: result is PASS with the LLM's explanation
- FAIL: result is FAIL with the LLM's explanation

Step 4 — Compile score report

After all assertions are evaluated, compile the score report:

Count total assertions run and total assertions passed.
List all failed assertions with their IDs, descriptions, and failure details.
Produce the structured output (see Outputs section).

Quality Checks

Every assertion in the evals.yaml is evaluated — no assertion is skipped silently
Each assertion result records its measured value or LLM rationale, not just PASS/FAIL
The total score is expressed as N/total (e.g., 4/5)
Failed assertions are listed with enough detail to understand what was measured and why it failed
The score report is structured such that an agent can parse it programmatically (not free prose)
If any assertion has an unsupported check value: report as ERROR, do not skip silently

Outputs

The skill produces a score report in the following structured Markdown format:

markdown

# Eval Report: {skill name} — {evals.yaml version}

**Output file:** {output_file path}
**Evals file:** {evals_file path}
**Run date:** {ISO 8601 date}
**Score:** {N}/{total} assertions passed

---

## Results

| ID | Type | Description | Result | Details |
|----|------|-------------|--------|---------|
| A01 | code | Word count within ±15% of target | PASS | 1247 words (range: 1020–1380) |
| A02 | code | Kill list word 'leverage' absent | FAIL | 2 matches found |
| A03 | llm-judge | Post stands alone without prior context | PASS | "The post opens with a clear hook and requires no prior context to understand." |

---

## Failed Assertions

### A02 — Kill list word 'leverage' absent
- **Type:** code
- **Check:** regex_not_match
- **Pattern:** `\bleverag(e|ing|ed)\b`
- **Result:** FAIL — 2 matches found at positions [line 4, line 11]

The score report may also be emitted as structured YAML if the invoking agent requires machine-readable output:

yaml

eval_report:
  skill: content-draft
  evals_version: "1.0"
  output_file: {path}
  evals_file: {path}
  run_date: {ISO 8601}
  score:
    passed: 4
    total: 5
    ratio: "4/5"
  results:
    - id: A01
      type: code
      description: "Word count within ±15% of target"
      result: PASS
      details: "1247 words (range: 1020–1380)"
    - id: A02
      type: code
      description: "Kill list word 'leverage' absent"
      result: FAIL
      details: "2 matches found"
  failed_assertions:
    - id: A02
      description: "Kill list word 'leverage' absent"
      type: code
      check: regex_not_match
      pattern: "\\bleverag(e|ing|ed)\\b"
      details: "2 matches found at positions [line 4, line 11]"

Non-Goals

This skill must NOT:

Modify the output file being evaluated
Modify the source skill whose output is being evaluated
Invoke any other skill (skills never chain)
Make recommendations about what to change in the skill or its output
Generate an evals.yaml file (that is agent work in the Skill Optimize protocol)
Compare scores across multiple runs (that is agent orchestration)
Propose a verdict on whether the skill should be updated (that is a human decision)

No silent skips. Every assertion produces an explicit PASS, FAIL, or ERROR result.

Maintainer

Fr-e-d Core maintainer

Source details

Full Name: Fr-e-d/GAAI-framework
Branch: main
Path in repo: .gaai/core/skills/cross/eval-run
License: Other
Topics: claude-code ai-agents ai-coding codex-cli cursor gemini-cli agentic-coding context-engineering vibe-coding opencode autonomous-agents devtools windsurf ai-governance ai-developer-tools ai-memory-system

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

Fr-e-d/GAAI-framework

ci-watch-and-fix

Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.

123 27

Explore

Fr-e-d/GAAI-framework

qa-review

Validate that implemented code fully satisfies Story acceptance criteria, respects rules, and introduces no regressions. This is the hard quality gate — no pass means no delivery. Activate after implementation is complete.

123 27

Explore

Fr-e-d/GAAI-framework

compose-team

Assemble the context bundles for each sub-agent based on evaluate-story output. Produces spawn-ready packages for Planning, Implementation, QA, or MicroDelivery sub-agents. Activate after evaluate-story, before spawning any sub-agent.

123 27

Explore

Fr-e-d/GAAI-framework

coordinate-handoffs

Validate sub-agent handoff artefacts, sequence phase transitions, and manage retry and escalation logic. Activate after each sub-agent terminates to determine next action.

123 27

Explore

Fr-e-d/GAAI-framework

implement

Generate correct, minimal, maintainable code that satisfies a validated Story's acceptance criteria against an execution plan. Activate when a Story is validated, a plan exists, and all prerequisites are unambiguous.

123 27

Explore

Fr-e-d/GAAI-framework

delivery-high-level-plan

Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.

123 27

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Eval Run

Purpose / When to Activate

Process

Step 1 — Load inputs

Step 2 — Run code assertions

Step 3 — Run llm-judge assertions

Step 4 — Compile score report

Quality Checks

Outputs

Non-Goals

Recommended Agent Skills

ci-watch-and-fix

qa-review

compose-team

coordinate-handoffs

implement

delivery-high-level-plan

Step 2 — Run `code` assertions

Step 3 — Run `llm-judge` assertions