Agent skill
skill-optimize
Run a structured evaluate-analyze-improve cycle on any GAAI skill to measure quality, detect regressions, and propose targeted improvements. Activate when a skill needs baseline evaluation, after SKILL.md modifications, or when friction-retrospective flags a skill.
Install this agent skill to your Project
npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/cross/skill-optimize
Metadata
Additional technical details for this skill
- id
- SKILL-CRS-026
- track
- cross-cutting
- author
- gaai-framework
- status
- experimental
- version
- 1.0
- category
- cross
- updated at
- 1773964800
SKILL.md
Skill Optimize
Purpose / When to Activate
Activate when:
- A skill needs a baseline quality measurement (no ledger.yaml exists yet)
- A SKILL.md has been modified and a before/after regression check is needed
friction-retrospectiveflags a skill as a recurring friction source- An eval cycle is needed after a manual skill update
This skill formalizes the Skill Optimize protocol referenced by eval-run (SKILL-CRS-025). It runs the full evaluate-analyze-improve loop with mandatory human gates at every modification step.
Scope: Any GAAI skill with measurable quality criteria — not limited to content-production skills. The inline Skill Optimize protocol in discovery.agent.md remains the agent's orchestration logic; this skill provides the structured execution procedure.
Process
Step 1 — Eval authoring
If no evals.yaml exists for the target skill:
- Read the target
SKILL.mdin full. - Identify measurable quality criteria from the
Quality Checkssection. - Author
evals.yamlfollowing theevals-format.mdspec (seeeval-run/references/evals-format.md). - Include a minimum of 5 assertions with a mix of
codeandllm-judgetypes. - Store the file at
{skill-dir}/eval-corpus/evals.yaml.
HUMAN CHECKPOINT: Present the drafted evals.yaml for validation. Do not proceed until approved. If rejected, revise based on feedback and re-present.
Step 2 — Corpus generation
If no corpus outputs exist in {skill-dir}/eval-corpus/:
- Identify the skill's expected inputs from its
inputs:frontmatter and Process section. - Produce 2-3 representative outputs by simulating the skill's expected inputs.
- Store each output in
{skill-dir}/eval-corpus/with namingcorpus-{N}.md.
If corpus outputs already exist (from prior runs or real production), use those. Prefer real outputs over synthetic when available.
Step 3 — Baseline evaluation
- Invoke
eval-run(SKILL-CRS-025) with each corpus output against theevals.yaml. - Compile per-output scores into
{skill-dir}/eval-corpus/score-baseline.yaml. - Record the aggregate:
passed / totalandpass_rate.
Step 4 — Error analysis
For each failed assertion across all corpus outputs:
- Identify the root cause in the target SKILL.md: which step, which instruction.
- Classify the failure:
instruction-gap— the skill doesn't instruct what is neededinstruction-ambiguity— the skill instructs ambiguouslyeval-design-error— the assertion is flawed, not the skillmodel-limitation— the model cannot reliably produce what is asked
- Produce
{skill-dir}/eval-corpus/error-analysis.mdwith per-assertion findings.
Step 5 — Improvement proposal
Based on the error analysis:
- Propose specific, minimal SKILL.md edits addressing
instruction-gapandinstruction-ambiguityfailures. - For
eval-design-errorfailures: propose evals.yaml corrections instead. - For
model-limitationfailures: document as known limitations, do not propose changes. - Present the proposal to the human.
HUMAN CHECKPOINT: The human approves, modifies, or rejects the proposal. NEVER auto-apply SKILL.md changes.
If approved:
- Apply the edits to SKILL.md.
- Re-run Steps 3-4 as a new iteration (score file:
score-{iteration}.yaml). - Compare against previous iteration scores.
Step 6 — Ledger update
After each iteration (including baseline), append an entry to {skill-dir}/quality/ledger.yaml:
iterations:
- id: {N}
date: {ISO 8601}
trigger: {trigger input value}
score:
passed: N
total: N
pass_rate: 0.XX
delta_vs_previous: +/-0.XX # null for baseline
failed_assertions: [ANN, ...]
action_taken: "{description of SKILL.md change, or 'baseline — no action'}"
status:
current_pass_rate: 0.XX
trend: improving | stable | degrading
slo_target: 0.85
error_budget_remaining: 0.XX
The ledger is append-only — iteration history is never deleted or overwritten.
For ledger format details, see references/ledger-format.md.
Step 7 — Trend detection
After updating the ledger:
- If
trend: degradingover 3+ consecutive iterations: escalate to human with full history and recommendation. - If
error_budget_remaining < 0(pass rate below SLO for 3+ iterations): flag the skill asneeds-optimizationin the ledger status. This blocks new deliveries using this skill until the human resolves it. - If
trend: improvingorstable: report status inline and complete.
Quality Checks
- Every iteration produces a score report — no silent skips
- Ledger is append-only — iteration history never deleted
- SKILL.md modifications require human approval (SkillsBench finding: self-generated skill edits = -1.3pp without human review)
- Per-assertion tracking in every score report, not just aggregate scores (prevents AP-8: aggregation hiding regressions)
- Mixed assertion types mandatory in evals.yaml: both
codeandllm-judge(prevents AP-1: self-model bias) - Error analysis classifies every failure — unclassified failures are not allowed
- Improvement proposals are minimal and targeted — no wholesale rewrites
Outputs
| Output | Path | Persistence |
|---|---|---|
| Eval assertions | {skill-dir}/eval-corpus/evals.yaml |
Created once, updated on eval-design-error |
| Corpus outputs | {skill-dir}/eval-corpus/corpus-{N}.md |
Stable across iterations |
| Score reports | {skill-dir}/eval-corpus/score-{iteration}.yaml |
One per iteration |
| Error analysis | {skill-dir}/eval-corpus/error-analysis.md |
Overwritten each iteration |
| Quality ledger | {skill-dir}/quality/ledger.yaml |
Append-only, never overwritten |
| Improvement proposal | Inline in session | Not persisted |
Non-Goals
This skill must NOT:
- Auto-modify SKILL.md without human approval (human gate is non-negotiable)
- Invoke the target skill to produce outputs (skills never chain — it evaluates existing outputs only)
- Compare quality across different skills (only within-skill across iterations)
- Set or modify SLO targets (human decision — skill only reads and reports against them)
- Generate corpus from production data without explicit human authorization
- Skip the error analysis step (every failure must be classified before proposing changes)
- Propose changes for
model-limitationfailures (these are documented, not "fixed")
For documented anti-patterns and mitigations, see references/anti-patterns.md.
No silent assumptions. Every evaluation result, every failure classification, every improvement proposal becomes explicit and governed.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
ci-watch-and-fix
Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.
qa-review
Validate that implemented code fully satisfies Story acceptance criteria, respects rules, and introduces no regressions. This is the hard quality gate — no pass means no delivery. Activate after implementation is complete.
compose-team
Assemble the context bundles for each sub-agent based on evaluate-story output. Produces spawn-ready packages for Planning, Implementation, QA, or MicroDelivery sub-agents. Activate after evaluate-story, before spawning any sub-agent.
coordinate-handoffs
Validate sub-agent handoff artefacts, sequence phase transitions, and manage retry and escalation logic. Activate after each sub-agent terminates to determine next action.
implement
Generate correct, minimal, maintainable code that satisfies a validated Story's acceptance criteria against an execution plan. Activate when a Story is validated, a plan exists, and all prerequisites are unambiguous.
delivery-high-level-plan
Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.
Didn't find tool you were looking for?