Agent skill

skill-optimize

Run a structured evaluate-analyze-improve cycle on any GAAI skill to measure quality, detect regressions, and propose targeted improvements. Activate when a skill needs baseline evaluation, after SKILL.md modifications, or when friction-retrospective flags a skill.

View SKILL.md on GitHub Repository

Stars 123

Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/Fr-e-d/GAAI-framework/tree/main/.gaai/core/skills/cross/skill-optimize

Metadata

Additional technical details for this skill

id: SKILL-CRS-026
track: cross-cutting
author: gaai-framework
status: experimental
version: 1.0
category: cross
updated at: 1773964800

SKILL.md

Skill Optimize

Purpose / When to Activate

Activate when:

A skill needs a baseline quality measurement (no ledger.yaml exists yet)
A SKILL.md has been modified and a before/after regression check is needed
friction-retrospective flags a skill as a recurring friction source
An eval cycle is needed after a manual skill update

This skill formalizes the Skill Optimize protocol referenced by eval-run (SKILL-CRS-025). It runs the full evaluate-analyze-improve loop with mandatory human gates at every modification step.

Scope: Any GAAI skill with measurable quality criteria — not limited to content-production skills. The inline Skill Optimize protocol in discovery.agent.md remains the agent's orchestration logic; this skill provides the structured execution procedure.

Process

Step 1 — Eval authoring

If no evals.yaml exists for the target skill:

Read the target SKILL.md in full.
Identify measurable quality criteria from the Quality Checks section.
Author evals.yaml following the evals-format.md spec (see eval-run/references/evals-format.md).
Include a minimum of 5 assertions with a mix of code and llm-judge types.
Store the file at {skill-dir}/eval-corpus/evals.yaml.

HUMAN CHECKPOINT: Present the drafted evals.yaml for validation. Do not proceed until approved. If rejected, revise based on feedback and re-present.

Step 2 — Corpus generation

If no corpus outputs exist in {skill-dir}/eval-corpus/:

Identify the skill's expected inputs from its inputs: frontmatter and Process section.
Produce 2-3 representative outputs by simulating the skill's expected inputs.
Store each output in {skill-dir}/eval-corpus/ with naming corpus-{N}.md.

If corpus outputs already exist (from prior runs or real production), use those. Prefer real outputs over synthetic when available.

Step 3 — Baseline evaluation

Invoke eval-run (SKILL-CRS-025) with each corpus output against the evals.yaml.
Compile per-output scores into {skill-dir}/eval-corpus/score-baseline.yaml.
Record the aggregate: passed / total and pass_rate.

Step 4 — Error analysis

For each failed assertion across all corpus outputs:

Identify the root cause in the target SKILL.md: which step, which instruction.
Classify the failure:
- instruction-gap — the skill doesn't instruct what is needed
- instruction-ambiguity — the skill instructs ambiguously
- eval-design-error — the assertion is flawed, not the skill
- model-limitation — the model cannot reliably produce what is asked
Produce {skill-dir}/eval-corpus/error-analysis.md with per-assertion findings.

Step 5 — Improvement proposal

Based on the error analysis:

Propose specific, minimal SKILL.md edits addressing instruction-gap and instruction-ambiguity failures.
For eval-design-error failures: propose evals.yaml corrections instead.
For model-limitation failures: document as known limitations, do not propose changes.
Present the proposal to the human.

HUMAN CHECKPOINT: The human approves, modifies, or rejects the proposal. NEVER auto-apply SKILL.md changes.

If approved:

Apply the edits to SKILL.md.
Re-run Steps 3-4 as a new iteration (score file: score-{iteration}.yaml).
Compare against previous iteration scores.

Step 6 — Ledger update

After each iteration (including baseline), append an entry to {skill-dir}/quality/ledger.yaml:

yaml

iterations:
  - id: {N}
    date: {ISO 8601}
    trigger: {trigger input value}
    score:
      passed: N
      total: N
      pass_rate: 0.XX
    delta_vs_previous: +/-0.XX  # null for baseline
    failed_assertions: [ANN, ...]
    action_taken: "{description of SKILL.md change, or 'baseline — no action'}"
status:
  current_pass_rate: 0.XX
  trend: improving | stable | degrading
  slo_target: 0.85
  error_budget_remaining: 0.XX

The ledger is append-only — iteration history is never deleted or overwritten.

For ledger format details, see references/ledger-format.md.

Step 7 — Trend detection

After updating the ledger:

If trend: degrading over 3+ consecutive iterations: escalate to human with full history and recommendation.
If error_budget_remaining < 0 (pass rate below SLO for 3+ iterations): flag the skill as needs-optimization in the ledger status. This blocks new deliveries using this skill until the human resolves it.
If trend: improving or stable: report status inline and complete.

Quality Checks

Every iteration produces a score report — no silent skips
Ledger is append-only — iteration history never deleted
SKILL.md modifications require human approval (SkillsBench finding: self-generated skill edits = -1.3pp without human review)
Per-assertion tracking in every score report, not just aggregate scores (prevents AP-8: aggregation hiding regressions)
Mixed assertion types mandatory in evals.yaml: both code and llm-judge (prevents AP-1: self-model bias)
Error analysis classifies every failure — unclassified failures are not allowed
Improvement proposals are minimal and targeted — no wholesale rewrites

Outputs

Output	Path	Persistence
Eval assertions	`{skill-dir}/eval-corpus/evals.yaml`	Created once, updated on eval-design-error
Corpus outputs	`{skill-dir}/eval-corpus/corpus-{N}.md`	Stable across iterations
Score reports	`{skill-dir}/eval-corpus/score-{iteration}.yaml`	One per iteration
Error analysis	`{skill-dir}/eval-corpus/error-analysis.md`	Overwritten each iteration
Quality ledger	`{skill-dir}/quality/ledger.yaml`	Append-only, never overwritten
Improvement proposal	Inline in session	Not persisted

Non-Goals

This skill must NOT:

Auto-modify SKILL.md without human approval (human gate is non-negotiable)
Invoke the target skill to produce outputs (skills never chain — it evaluates existing outputs only)
Compare quality across different skills (only within-skill across iterations)
Set or modify SLO targets (human decision — skill only reads and reports against them)
Generate corpus from production data without explicit human authorization
Skip the error analysis step (every failure must be classified before proposing changes)
Propose changes for model-limitation failures (these are documented, not "fixed")

For documented anti-patterns and mitigations, see references/anti-patterns.md.

No silent assumptions. Every evaluation result, every failure classification, every improvement proposal becomes explicit and governed.

Maintainer

Fr-e-d Core maintainer

Source details

Full Name: Fr-e-d/GAAI-framework
Branch: main
Path in repo: .gaai/core/skills/cross/skill-optimize
License: Other
Topics: claude-code ai-agents ai-coding codex-cli cursor gemini-cli agentic-coding context-engineering vibe-coding opencode autonomous-agents devtools windsurf ai-governance ai-developer-tools ai-memory-system

Featured Tools

Join Our Newsletter

Transform validated Stories into a clear, minimal, governed execution plan. Used by the Planning Sub-Agent as the first planning pass before prepare-execution-plan for Tier 2/3, or as the sole planning output for simple Stories.

123 27

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Skill Optimize

Purpose / When to Activate

Process

Step 1 — Eval authoring

Step 2 — Corpus generation

Step 3 — Baseline evaluation

Step 4 — Error analysis

Step 5 — Improvement proposal

Step 6 — Ledger update

Step 7 — Trend detection

Quality Checks

Outputs

Non-Goals

Recommended Agent Skills

ci-watch-and-fix

qa-review

compose-team

coordinate-handoffs

implement

delivery-high-level-plan