Agent skill

self-improving-agent-builder

Encodes a continuous improvement loop for goal-seeking agents: EVAL, ANALYZE, RESEARCH (hypothesis + evidence + counter-arguments), IMPROVE, RE-EVAL, DECIDE. Auto-commits improvements (+2% net, no regression >5%) and reverts failures. Works with all 4 SDK implementations. Auto-activates on "improve agent", "self-improving loop", "agent eval loop", "benchmark agents", "run improvement cycle".

Stars 45
Forks 28

Install this agent skill to your Project

npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/self-improving-agent-builder

SKILL.md

Self-Improving Agent Builder

Purpose

Run a closed-loop improvement cycle on any goal-seeking agent implementation:

EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)

Each iteration measures L1-L12 progressive test scores, identifies failures with error_analyzer.py, runs a research step with hypothesis/evidence/ counter-arguments, applies targeted fixes, and gates promotion through regression checks.

When I Activate

  • "improve agent" or "self-improving loop"
  • "agent eval loop" or "run improvement cycle"
  • "benchmark agents" or "compare SDK implementations"
  • "iterate on agent scores" or "fix agent regressions"

Quick Start

User: "Run the self-improving loop on the mini-framework agent for 3 iterations"

Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
       Reports per-iteration scores, net improvement, and commits/reverts.

Runner Script

The self-improvement loop is implemented as a Python CLI:

bash
# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3

# Full options
python -m amplihack.eval.self_improve.runner \
  --sdk mini \
  --iterations 5 \
  --improvement-threshold 2.0 \
  --regression-tolerance 5.0 \
  --levels L1 L2 L3 L4 L5 L6 \
  --output-dir ./eval_results/self_improve \
  --dry-run  # evaluate only, don't apply changes

Source: src/amplihack/eval/self_improve/runner.py

The Loop (6 Phases per Iteration)

Phase 1: EVAL

Run the L1-L12 progressive test suite on the current agent implementation.

Execution:

bash
python -m amplihack.eval.progressive_test_suite \
  --agent-name <agent_name> \
  --output-dir <output_dir>/iteration_N/eval \
  --levels L1 L2 L3 L4 L5 L6

Output: Per-level scores and overall baseline.

Phase 2: ANALYZE

Classify failures using error_analyzer.py. Maps each failed question to a failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and the specific code component responsible.

python
from amplihack.eval.self_improve import analyze_eval_results

analyses = analyze_eval_results(level_results, score_threshold=0.6)
# Each ErrorAnalysis maps to:
#   failure_mode -> affected_component -> prompt_template

Phase 3: RESEARCH (New)

The critical thinking step that prevents blind changes. For each proposed improvement:

  1. State hypothesis: What specific change will fix the failure?
  2. Gather evidence: From eval results, failure patterns, baseline scores
  3. Consider counter-arguments: What could go wrong? Risk of regression?
  4. Make decision: Apply, skip, or defer with full reasoning

Decisions are logged in research_decisions.json for auditability.

Decision criteria:

  • Apply: Clear failure pattern + prompt template available + low score
  • Skip: Score above 50% (likely stochastic variation)
  • Defer: Ambiguous evidence, needs more data

Phase 4: IMPROVE

Apply the improvements approved by the research step. Priority order:

  1. Prompt template improvements (safest, highest impact)
  2. Retrieval strategy adjustments
  3. Code logic fixes (most risky, needs careful review)

Phase 5: RE-EVAL

Re-run the same eval suite after applying fixes to measure impact.

Phase 6: DECIDE

Promotion gate:

  • Net improvement >= +2% overall score: COMMIT the changes
  • Any single level regression > 5%: REVERT all changes
  • Otherwise: COMMIT with marginal improvement note

Configuration

Parameter Default Description
sdk_type mini Which SDK: mini/claude/copilot/microsoft
max_iterations 5 Maximum improvement iterations
improvement_threshold 2.0 Minimum % improvement to commit
regression_tolerance 5.0 Maximum % regression on any level
levels L1-L6 Which levels to evaluate
output_dir ./eval_results/self_improve Results directory
dry_run false Evaluate only, don't apply changes

Programmatic Usage

python
from amplihack.eval.self_improve import run_self_improvement, RunnerConfig

config = RunnerConfig(
    sdk_type="mini",
    max_iterations=3,
    improvement_threshold=2.0,
    regression_tolerance=5.0,
    levels=["L1", "L2", "L3", "L4", "L5", "L6"],
    output_dir="./eval_results/self_improve",
    dry_run=False,
)

result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")

4-Way Benchmark Mode

Compare all SDK implementations side by side:

User: "Run a 4-way benchmark comparing all SDK implementations"

Skill: Runs eval suite on mini, claude, copilot, microsoft
       Generates comparison table with scores, LOC, and coverage.

Integration Points

  • src/amplihack/eval/self_improve/runner.py: Self-improvement loop runner
  • src/amplihack/eval/self_improve/error_analyzer.py: Failure classification
  • src/amplihack/eval/progressive_test_suite.py: L1-L12 eval runner
  • src/amplihack/agents/goal_seeking/sdk_adapters/: All 4 SDK implementations
  • src/amplihack/eval/metacognition_grader.py: Advanced eval dimensions
  • src/amplihack/eval/teaching_session.py: L7 teaching quality eval

Expand your agent's capabilities with these related and highly-rated skills.

rysweet/amplihack

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

45 28
Explore
rysweet/amplihack

learning-path-builder

Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.

45 28
Explore
rysweet/amplihack

gh-work-report

Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".

45 28
Explore
rysweet/amplihack

pr-review-assistant

Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.

45 28
Explore
rysweet/amplihack

code-smell-detector

Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.

45 28
Explore
rysweet/amplihack

biologist-analyst

Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.

45 28
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results