Agent skills
self-improving-agent-builder

Agent skill

self-improving-agent-builder

Encodes a continuous improvement loop for goal-seeking agents: EVAL, ANALYZE, RESEARCH (hypothesis + evidence + counter-arguments), IMPROVE, RE-EVAL, DECIDE. Auto-commits improvements (+2% net, no regression >5%) and reverts failures. Works with all 4 SDK implementations. Auto-activates on "improve agent", "self-improving loop", "agent eval loop", "benchmark agents", "run improvement cycle".

View SKILL.md on GitHub Repository

Stars 45

Forks 28

Install this agent skill to your Project

npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/self-improving-agent-builder

SKILL.md

Self-Improving Agent Builder

Purpose

Run a closed-loop improvement cycle on any goal-seeking agent implementation:

EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)

Each iteration measures L1-L12 progressive test scores, identifies failures with error_analyzer.py, runs a research step with hypothesis/evidence/ counter-arguments, applies targeted fixes, and gates promotion through regression checks.

When I Activate

"improve agent" or "self-improving loop"
"agent eval loop" or "run improvement cycle"
"benchmark agents" or "compare SDK implementations"
"iterate on agent scores" or "fix agent regressions"

Quick Start

User: "Run the self-improving loop on the mini-framework agent for 3 iterations"

Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
       Reports per-iteration scores, net improvement, and commits/reverts.

Runner Script

The self-improvement loop is implemented as a Python CLI:

bash

# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3

# Full options
python -m amplihack.eval.self_improve.runner \
  --sdk mini \
  --iterations 5 \
  --improvement-threshold 2.0 \
  --regression-tolerance 5.0 \
  --levels L1 L2 L3 L4 L5 L6 \
  --output-dir ./eval_results/self_improve \
  --dry-run  # evaluate only, don't apply changes

Source: src/amplihack/eval/self_improve/runner.py

The Loop (6 Phases per Iteration)

Phase 1: EVAL

Run the L1-L12 progressive test suite on the current agent implementation.

Execution:

bash

python -m amplihack.eval.progressive_test_suite \
  --agent-name <agent_name> \
  --output-dir <output_dir>/iteration_N/eval \
  --levels L1 L2 L3 L4 L5 L6

Output: Per-level scores and overall baseline.

Phase 2: ANALYZE

Classify failures using error_analyzer.py. Maps each failed question to a failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and the specific code component responsible.

python

from amplihack.eval.self_improve import analyze_eval_results

analyses = analyze_eval_results(level_results, score_threshold=0.6)
# Each ErrorAnalysis maps to:
#   failure_mode -> affected_component -> prompt_template

Phase 3: RESEARCH (New)

The critical thinking step that prevents blind changes. For each proposed improvement:

State hypothesis: What specific change will fix the failure?
Gather evidence: From eval results, failure patterns, baseline scores
Consider counter-arguments: What could go wrong? Risk of regression?
Make decision: Apply, skip, or defer with full reasoning

Decisions are logged in research_decisions.json for auditability.

Decision criteria:

Apply: Clear failure pattern + prompt template available + low score
Skip: Score above 50% (likely stochastic variation)
Defer: Ambiguous evidence, needs more data

Phase 4: IMPROVE

Apply the improvements approved by the research step. Priority order:

Prompt template improvements (safest, highest impact)
Retrieval strategy adjustments
Code logic fixes (most risky, needs careful review)

Phase 5: RE-EVAL

Re-run the same eval suite after applying fixes to measure impact.

Phase 6: DECIDE

Promotion gate:

Net improvement >= +2% overall score: COMMIT the changes
Any single level regression > 5%: REVERT all changes
Otherwise: COMMIT with marginal improvement note

Configuration

Parameter	Default	Description
`sdk_type`	`mini`	Which SDK: mini/claude/copilot/microsoft
`max_iterations`	`5`	Maximum improvement iterations
`improvement_threshold`	`2.0`	Minimum % improvement to commit
`regression_tolerance`	`5.0`	Maximum % regression on any level
`levels`	`L1-L6`	Which levels to evaluate
`output_dir`	`./eval_results/self_improve`	Results directory
`dry_run`	`false`	Evaluate only, don't apply changes

Programmatic Usage

python

from amplihack.eval.self_improve import run_self_improvement, RunnerConfig

config = RunnerConfig(
    sdk_type="mini",
    max_iterations=3,
    improvement_threshold=2.0,
    regression_tolerance=5.0,
    levels=["L1", "L2", "L3", "L4", "L5", "L6"],
    output_dir="./eval_results/self_improve",
    dry_run=False,
)

result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")

4-Way Benchmark Mode

Compare all SDK implementations side by side:

User: "Run a 4-way benchmark comparing all SDK implementations"

Skill: Runs eval suite on mini, claude, copilot, microsoft
       Generates comparison table with scores, LOC, and coverage.

Integration Points

src/amplihack/eval/self_improve/runner.py: Self-improvement loop runner
src/amplihack/eval/self_improve/error_analyzer.py: Failure classification
src/amplihack/eval/progressive_test_suite.py: L1-L12 eval runner
src/amplihack/agents/goal_seeking/sdk_adapters/: All 4 SDK implementations
src/amplihack/eval/metacognition_grader.py: Advanced eval dimensions
src/amplihack/eval/teaching_session.py: L7 teaching quality eval

Maintainer

rysweet Core maintainer

Source details

Full Name: rysweet/amplihack
Branch: main
Path in repo: .claude/skills/self-improving-agent-builder

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

rysweet/amplihack

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

45 28

Explore

rysweet/amplihack

learning-path-builder

Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.

45 28

Explore

rysweet/amplihack

gh-work-report

Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".

45 28

Explore

rysweet/amplihack

pr-review-assistant

Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.

45 28

Explore

rysweet/amplihack

code-smell-detector

Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.

45 28

Explore

rysweet/amplihack

biologist-analyst

Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.

45 28

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Self-Improving Agent Builder

Purpose

When I Activate

Quick Start

Runner Script

The Loop (6 Phases per Iteration)

Phase 1: EVAL

Phase 2: ANALYZE

Phase 3: RESEARCH (New)

Phase 4: IMPROVE

Phase 5: RE-EVAL

Phase 6: DECIDE

Configuration

Programmatic Usage

4-Way Benchmark Mode

Integration Points

Recommended Agent Skills

chemist-analyst

learning-path-builder

gh-work-report

pr-review-assistant

code-smell-detector

biologist-analyst