Agent skill
Skill Evals
Evaluate skill output quality against assertion manifests — detects regressions before users notice
Install this agent skill to your Project
npx add-skill https://github.com/aaronjmars/aeon/tree/main/skills/skill-evals
SKILL.md
${var} — Skill name to evaluate. If empty, evaluates all skills in evals.json.
Today is ${today}. Your task is to evaluate skill output quality by validating recent outputs against assertions defined in skills/skill-evals/evals.json.
Steps
-
Read the assertion manifest — read
skills/skill-evals/evals.json. -
Read
aeon.ymlto get the full registered skill list. -
Read
memory/cron-state.json— containstotal_runs,success_rate,last_quality_scoreper skill. -
Determine which skills to evaluate:
- If
${var}is set, evaluate only that skill. - Otherwise evaluate all skills listed in evals.json.
- If
-
For each skill being evaluated, run the following checks:
a. Find the most recent output file using the glob pattern from evals.json (
output_pattern).- Use Glob to find all matching files.
- Sort by filename descending to get the most recent.
- If no matching file exists → mark as NO COVERAGE.
b. Word count check: count words in the file. Fail if below
min_words.c. Required pattern check: for each pattern in
required_patterns, search the file content.- Patterns are pipe-separated alternatives (e.g.
"stars|forks"— either must appear). - Fail if any required pattern is not found.
d. Forbidden pattern check: for each pattern in
forbidden_patterns, search the file.- Fail if any forbidden pattern IS found.
e. Numeric checks (if
numeric_checksis defined): for each entry:- Extract the first number matching the regex
patternfrom the file. - Fail if the extracted value is outside [
min,max]. - If no match found and the field is expected (skip if not found is not specified), flag as WARN.
f. Quality score cross-check: read
memory/skill-health/{skill}.jsonif it exists.- If
avg_score< 2.5 → flag as QUALITY_DEGRADED even if assertions pass. - If
avg_score>= 2.5 → note the score in the report.
-
Classify each skill:
- PASS — all assertions pass, quality score >= 2.5 (or no health data yet)
- FAIL — one or more assertions failed (word count, required pattern, forbidden pattern, or numeric check)
- QUALITY_DEGRADED — assertions pass but avg quality score < 2.5
- NO COVERAGE — no output file found matching the pattern
-
Detect coverage gaps — skills that have cron-state entries with
total_runs > 0but are NOT in evals.json. These are skills running in production without any eval spec. -
Write the report to
articles/skill-evals-${today}.md:markdown# Skill Evals — ${today} ## Results | Skill | Status | Details | |-------|--------|---------| | heartbeat | PASS | 52 words, all patterns matched | | repo-pulse | FAIL | Missing pattern: "stars" | | token-report | NO COVERAGE | No output file found | ## Summary - Evaluated: N skills (from evals.json) - Passing: N - Failing: N - Quality degraded: N - No coverage: N ## Coverage Gaps (running in production but not in evals.json) - skill-name: N runs (success_rate: X%) ## Recommendations [List specific fixes for failing skills and skills to add to evals.json] -
Send notification via
./notifyif any skills are FAILING or QUALITY_DEGRADED. Include the full results table in the message. If all skills pass or have no coverage issues, send a brief summary: "Skill Evals PASS — N/N skills healthy." -
Log to
memory/logs/${today}.md:## Skill Evals — ${today} - Evaluated N skills from evals.json - PASS: X, FAIL: Y, NO_COVERAGE: Z - Coverage gaps: [list skill names]
Write complete output with no placeholders.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
Polymarket
Trending and top markets on Polymarket — volume, new markets, biggest movers
Daily Article
Research trending topics and write a publication-ready article
DeFi Monitor
Check pool health, positions, and yield rates for tracked protocols
Monitor Polymarket
Monitor specific prediction markets for 24h price moves, volume changes, and fresh comments
Self Review
Weekly audit of what Aeon did, what failed, and what to improve
push-recap
Daily deep-dive recap of all pushes — reads diffs, explains what changed and why
Didn't find tool you were looking for?