Agent skills
eval-recipes-runner

Agent skill

eval-recipes-runner

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Auto-activates when testing improvements, running evals, or benchmarking changes.

View SKILL.md on GitHub Repository

Stars 45

Forks 28

Install this agent skill to your Project

npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/eval-recipes-runner

SKILL.md

eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

How It Works

Setup (One-Time)

bash

# Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes

# Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

# Install dependencies
uv sync

Running Benchmarks

Test a specific branch:

bash

# Update install.dockerfile to use specific branch
# Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Compare before/after:

bash

# Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

# Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

# Compare scores

Available Tasks

Common tasks from eval-recipes:

linkedin_drafting - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
email_drafting - Create CLI tool for emails (scored 26/100 before)
arxiv_paper_summarizer - Research tool
github_docs_extractor - Documentation tool
Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

Identify the branch/PR to test

Update agent config to use that branch:

dockerfile

# In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
    cd /tmp/amplihack && \
    git checkout BRANCH_NAME && \
    pip install -e .

Copy to eval-recipes:

bash

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

Run benchmark:

bash

cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

Report scores and compare with baseline

Expected Scores

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout feat/issue-1435-task-classification
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

eval-recipes cloned to ~/eval-recipes
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
Docker installed (for containerized runs)
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

bash

# Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
  uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done

# Compare results
cat .benchmark_results/*/amplihack/*/score.txt

Maintainer

rysweet Core maintainer

Source details

Full Name: rysweet/amplihack
Branch: main
Path in repo: .claude/skills/eval-recipes-runner

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

rysweet/amplihack

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

45 28

Explore

rysweet/amplihack

learning-path-builder

Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.

45 28

Explore

rysweet/amplihack

gh-work-report

Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".

45 28

Explore

rysweet/amplihack

pr-review-assistant

Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.

45 28

Explore

rysweet/amplihack

code-smell-detector

Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.

45 28

Explore

rysweet/amplihack

biologist-analyst

Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.

45 28

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

eval-recipes Runner Skill

Purpose

When to Use

Capabilities

How It Works

Setup (One-Time)

Running Benchmarks

Available Tasks

Typical Workflow

Expected Scores

Example Usage

Prerequisites

Notes

Automation

Recommended Agent Skills

chemist-analyst

learning-path-builder

gh-work-report

pr-review-assistant

code-smell-detector

biologist-analyst