Agent skill
eval-recipes-runner
Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Auto-activates when testing improvements, running evals, or benchmarking changes.
Install this agent skill to your Project
npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/eval-recipes-runner
SKILL.md
eval-recipes Runner Skill
Purpose
Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
When to Use
- User asks to "test with eval-recipes"
- User says "run the evals" or "benchmark this change"
- User wants to validate improvements against codex/claude_code
- Testing a PR branch to prove it improves scores
Capabilities
I can run eval-recipes benchmarks to:
- Test specific amplihack branches
- Compare against baseline agents (codex, claude_code)
- Run specific tasks (linkedin_drafting, email_drafting, etc.)
- Compare before/after scores for PRs
- Generate reports with score improvements
How It Works
Setup (One-Time)
# Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes
# Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/
# Install dependencies
uv sync
Running Benchmarks
Test a specific branch:
# Update install.dockerfile to use specific branch
# Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Compare before/after:
# Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting
# Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting
# Compare scores
Available Tasks
Common tasks from eval-recipes:
linkedin_drafting- Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)email_drafting- Create CLI tool for emails (scored 26/100 before)arxiv_paper_summarizer- Research toolgithub_docs_extractor- Documentation tool- Many more in
~/eval-recipes/data/tasks/
Typical Workflow
When user says "test this change with eval-recipes":
- Identify the branch/PR to test
- Update agent config to use that branch:
dockerfile
# In .claude/agents/eval-recipes/amplihack/install.dockerfile RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \ cd /tmp/amplihack && \ git checkout BRANCH_NAME && \ pip install -e . - Copy to eval-recipes:
bash
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run benchmark:
bash
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3 - Report scores and compare with baseline
Expected Scores
Baseline (main branch):
- Overall: 40.6/100
- LinkedIn: 6.5/100
- Email: 26/100
With PR #1443 (task classification):
- Expected: 55-60/100 (+15-20 points)
- LinkedIn: 30-40/100 (creates actual tool)
- Email: 45/100 (consistent execution)
Example Usage
User says: "Test PR #1443 with eval-recipes on the LinkedIn task"
I do:
- Update install.dockerfile to checkout
feat/issue-1435-task-classification - Copy to eval-recipes:
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run:
cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3 - Report results: "Score: 35.2/100 (up from 6.5 baseline)"
Prerequisites
- eval-recipes cloned to
~/eval-recipes - API key in environment:
export ANTHROPIC_API_KEY=sk-ant-... - Docker installed (for containerized runs)
- uv installed:
curl -LsSf https://astral.sh/uv/install.sh | sh
Notes
- Benchmarks take 2-15 minutes per task depending on complexity
- Multiple trials (3-5) give more reliable averages
- Docker builds can be cached for speed
- Results saved to
.benchmark_results/in eval-recipes repo
Automation
For fully autonomous testing:
# Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done
# Compare results
cat .benchmark_results/*/amplihack/*/score.txt
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
learning-path-builder
Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.
gh-work-report
Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".
pr-review-assistant
Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.
code-smell-detector
Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.
biologist-analyst
Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.
Didn't find tool you were looking for?