Agent skill
eval-run
Evaluate any output file against a structured evals.yaml assertions file and produce a score report with per-assertion pass/fail results. Activate when the Discovery Agent runs the Skill Optimize protocol to measure output quality or detect regressions after skill instruction changes.
Install this agent skill to your Project
npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/cross/eval-run
Metadata
Additional technical details for this skill
- id
- SKILL-CRS-025
- track
- cross-cutting
- author
- gaai-framework
- status
- experimental
- version
- 1.0
- category
- cross
- updated at
- 1773532800
SKILL.md
Eval Run
Purpose / When to Activate
Activate when:
- The Discovery Agent runs the Skill Optimize protocol and needs to score a skill output
- A skill's instructions have been modified and a before/after quality comparison is needed
- A baseline score is being established for a skill that has never been evaluated
This skill is generic: it accepts any output file and any evals.yaml, regardless of skill domain.
It follows the GAAI principle "skills never chain" — it evaluates the output it receives; it does not invoke the skill that produced the output.
Process
Step 1 — Load inputs
- Read the
output_filepath. Confirm the file exists and is non-empty. If missing: FAIL immediately with error "output_file not found: {path}". - Read the
evals_filepath. Confirm the file exists and is valid YAML. If missing: FAIL immediately with error "evals_file not found: {path}". - Parse the
evals.yamlstructure. Validate:skill,version,description, andassertionsfields are presentassertionslist is non-empty- Each assertion has
id,type, anddescriptionfields - If any required field is missing: FAIL with error "evals.yaml validation error: {details}"
For the full evals.yaml format spec, see references/evals-format.md.
Step 2 — Run code assertions
For each assertion where type: code:
-
Read the
checkfield. Execute the corresponding mechanical verification:checkVerification method word_countCount whitespace-separated tokens in the output file. Compare against params.minandparams.max.char_countCount all characters in the output file. Compare against params.minandparams.max.regex_matchApply params.patternas a regex to the full output text. PASS if at least one match found.regex_not_matchApply params.patternas a regex to the full output text. PASS if zero matches found.structure_presentSearch the output text for the literal string params.marker. PASS if found.structure_absentSearch the output text for the literal string params.marker. PASS if NOT found. -
Record the result:
- PASS: the assertion result is PASS with the measured value (e.g., word count = 1247)
- FAIL: the assertion result is FAIL with the measured value and the expected condition
Step 3 — Run llm-judge assertions
For each assertion where type: llm-judge:
-
Construct the evaluation prompt:
{assertion.prompt} --- OUTPUT TO EVALUATE: {full content of output_file} -
Submit the prompt. Parse the response for a binary verdict:
PASSorFAIL. -
Extract the one-sentence explanation from the response.
-
Record the result:
- PASS: result is PASS with the LLM's explanation
- FAIL: result is FAIL with the LLM's explanation
Step 4 — Compile score report
After all assertions are evaluated, compile the score report:
- Count total assertions run and total assertions passed.
- List all failed assertions with their IDs, descriptions, and failure details.
- Produce the structured output (see Outputs section).
Quality Checks
- Every assertion in the evals.yaml is evaluated — no assertion is skipped silently
- Each assertion result records its measured value or LLM rationale, not just PASS/FAIL
- The total score is expressed as
N/total(e.g.,4/5) - Failed assertions are listed with enough detail to understand what was measured and why it failed
- The score report is structured such that an agent can parse it programmatically (not free prose)
- If any assertion has an unsupported
checkvalue: report as ERROR, do not skip silently
Outputs
The skill produces a score report in the following structured Markdown format:
# Eval Report: {skill name} — {evals.yaml version}
**Output file:** {output_file path}
**Evals file:** {evals_file path}
**Run date:** {ISO 8601 date}
**Score:** {N}/{total} assertions passed
---
## Results
| ID | Type | Description | Result | Details |
|----|------|-------------|--------|---------|
| A01 | code | Word count within ±15% of target | PASS | 1247 words (range: 1020–1380) |
| A02 | code | Kill list word 'leverage' absent | FAIL | 2 matches found |
| A03 | llm-judge | Post stands alone without prior context | PASS | "The post opens with a clear hook and requires no prior context to understand." |
---
## Failed Assertions
### A02 — Kill list word 'leverage' absent
- **Type:** code
- **Check:** regex_not_match
- **Pattern:** `\bleverag(e|ing|ed)\b`
- **Result:** FAIL — 2 matches found at positions [line 4, line 11]
The score report may also be emitted as structured YAML if the invoking agent requires machine-readable output:
eval_report:
skill: content-draft
evals_version: "1.0"
output_file: {path}
evals_file: {path}
run_date: {ISO 8601}
score:
passed: 4
total: 5
ratio: "4/5"
results:
- id: A01
type: code
description: "Word count within ±15% of target"
result: PASS
details: "1247 words (range: 1020–1380)"
- id: A02
type: code
description: "Kill list word 'leverage' absent"
result: FAIL
details: "2 matches found"
failed_assertions:
- id: A02
description: "Kill list word 'leverage' absent"
type: code
check: regex_not_match
pattern: "\\bleverag(e|ing|ed)\\b"
details: "2 matches found at positions [line 4, line 11]"
Non-Goals
This skill must NOT:
- Modify the output file being evaluated
- Modify the source skill whose output is being evaluated
- Invoke any other skill (skills never chain)
- Make recommendations about what to change in the skill or its output
- Generate an evals.yaml file (that is agent work in the Skill Optimize protocol)
- Compare scores across multiple runs (that is agent orchestration)
- Propose a verdict on whether the skill should be updated (that is a human decision)
No silent skips. Every assertion produces an explicit PASS, FAIL, or ERROR result.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
ci-watch-and-fix
Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.
qa-review
Validate that implemented code fully satisfies Story acceptance criteria, respects rules, and introduces no regressions. This is the hard quality gate — no pass means no delivery. Activate after implementation is complete.
compose-team
Assemble the context bundles for each sub-agent based on evaluate-story output. Produces spawn-ready packages for Planning, Implementation, QA, or MicroDelivery sub-agents. Activate after evaluate-story, before spawning any sub-agent.
coordinate-handoffs
Validate sub-agent handoff artefacts, sequence phase transitions, and manage retry and escalation logic. Activate after each sub-agent terminates to determine next action.
implement
Generate correct, minimal, maintainable code that satisfies a validated Story's acceptance criteria against an execution plan. Activate when a Story is validated, a plan exists, and all prerequisites are unambiguous.
delivery-high-level-plan
Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.
Didn't find tool you were looking for?