Agent skill
self-improving-agent-builder
Encodes a continuous improvement loop for goal-seeking agents: EVAL, ANALYZE, RESEARCH (hypothesis + evidence + counter-arguments), IMPROVE, RE-EVAL, DECIDE. Auto-commits improvements (+2% net, no regression >5%) and reverts failures. Works with all 4 SDK implementations. Auto-activates on "improve agent", "self-improving loop", "agent eval loop", "benchmark agents", "run improvement cycle".
Install this agent skill to your Project
npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/self-improving-agent-builder
SKILL.md
Self-Improving Agent Builder
Purpose
Run a closed-loop improvement cycle on any goal-seeking agent implementation:
EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)
Each iteration measures L1-L12 progressive test scores, identifies failures
with error_analyzer.py, runs a research step with hypothesis/evidence/
counter-arguments, applies targeted fixes, and gates promotion through
regression checks.
When I Activate
- "improve agent" or "self-improving loop"
- "agent eval loop" or "run improvement cycle"
- "benchmark agents" or "compare SDK implementations"
- "iterate on agent scores" or "fix agent regressions"
Quick Start
User: "Run the self-improving loop on the mini-framework agent for 3 iterations"
Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
Reports per-iteration scores, net improvement, and commits/reverts.
Runner Script
The self-improvement loop is implemented as a Python CLI:
# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3
# Full options
python -m amplihack.eval.self_improve.runner \
--sdk mini \
--iterations 5 \
--improvement-threshold 2.0 \
--regression-tolerance 5.0 \
--levels L1 L2 L3 L4 L5 L6 \
--output-dir ./eval_results/self_improve \
--dry-run # evaluate only, don't apply changes
Source: src/amplihack/eval/self_improve/runner.py
The Loop (6 Phases per Iteration)
Phase 1: EVAL
Run the L1-L12 progressive test suite on the current agent implementation.
Execution:
python -m amplihack.eval.progressive_test_suite \
--agent-name <agent_name> \
--output-dir <output_dir>/iteration_N/eval \
--levels L1 L2 L3 L4 L5 L6
Output: Per-level scores and overall baseline.
Phase 2: ANALYZE
Classify failures using error_analyzer.py. Maps each failed question to a
failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and
the specific code component responsible.
from amplihack.eval.self_improve import analyze_eval_results
analyses = analyze_eval_results(level_results, score_threshold=0.6)
# Each ErrorAnalysis maps to:
# failure_mode -> affected_component -> prompt_template
Phase 3: RESEARCH (New)
The critical thinking step that prevents blind changes. For each proposed improvement:
- State hypothesis: What specific change will fix the failure?
- Gather evidence: From eval results, failure patterns, baseline scores
- Consider counter-arguments: What could go wrong? Risk of regression?
- Make decision: Apply, skip, or defer with full reasoning
Decisions are logged in research_decisions.json for auditability.
Decision criteria:
- Apply: Clear failure pattern + prompt template available + low score
- Skip: Score above 50% (likely stochastic variation)
- Defer: Ambiguous evidence, needs more data
Phase 4: IMPROVE
Apply the improvements approved by the research step. Priority order:
- Prompt template improvements (safest, highest impact)
- Retrieval strategy adjustments
- Code logic fixes (most risky, needs careful review)
Phase 5: RE-EVAL
Re-run the same eval suite after applying fixes to measure impact.
Phase 6: DECIDE
Promotion gate:
- Net improvement >= +2% overall score: COMMIT the changes
- Any single level regression > 5%: REVERT all changes
- Otherwise: COMMIT with marginal improvement note
Configuration
| Parameter | Default | Description |
|---|---|---|
sdk_type |
mini |
Which SDK: mini/claude/copilot/microsoft |
max_iterations |
5 |
Maximum improvement iterations |
improvement_threshold |
2.0 |
Minimum % improvement to commit |
regression_tolerance |
5.0 |
Maximum % regression on any level |
levels |
L1-L6 |
Which levels to evaluate |
output_dir |
./eval_results/self_improve |
Results directory |
dry_run |
false |
Evaluate only, don't apply changes |
Programmatic Usage
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig
config = RunnerConfig(
sdk_type="mini",
max_iterations=3,
improvement_threshold=2.0,
regression_tolerance=5.0,
levels=["L1", "L2", "L3", "L4", "L5", "L6"],
output_dir="./eval_results/self_improve",
dry_run=False,
)
result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")
4-Way Benchmark Mode
Compare all SDK implementations side by side:
User: "Run a 4-way benchmark comparing all SDK implementations"
Skill: Runs eval suite on mini, claude, copilot, microsoft
Generates comparison table with scores, LOC, and coverage.
Integration Points
src/amplihack/eval/self_improve/runner.py: Self-improvement loop runnersrc/amplihack/eval/self_improve/error_analyzer.py: Failure classificationsrc/amplihack/eval/progressive_test_suite.py: L1-L12 eval runnersrc/amplihack/agents/goal_seeking/sdk_adapters/: All 4 SDK implementationssrc/amplihack/eval/metacognition_grader.py: Advanced eval dimensionssrc/amplihack/eval/teaching_session.py: L7 teaching quality eval
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
learning-path-builder
Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.
gh-work-report
Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".
pr-review-assistant
Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.
code-smell-detector
Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.
biologist-analyst
Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.
Didn't find tool you were looking for?