Agent skill
model-evaluation-benchmark
Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3. Auto-activates for model benchmarking, comparison evaluation, or performance testing between AI models.
Install this agent skill to your Project
npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/model-evaluation-benchmark
SKILL.md
Model Evaluation Benchmark Skill
Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.
Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.
Skill Description
This skill orchestrates end-to-end model evaluation benchmarks that measure:
- Efficiency: Duration, turns, cost, tool calls
- Quality: Code quality scores via reviewer agents
- Workflow Adherence: Subagent calls, skills used, workflow step compliance
- Artifacts: GitHub issues, PRs, documentation generated
The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.
When to Use
✅ Use when:
- Comparing AI models (Opus vs Sonnet, etc.)
- Measuring workflow adherence
- Generating comprehensive benchmark reports
- Need reproducible benchmarking
❌ Don't use when:
- Simple code reviews (use
reviewer) - Performance profiling (use
optimizer) - Architecture decisions (use
architect)
Execution Instructions
When this skill is invoked, follow these steps:
Phase 1: Setup
- Read
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
- Create TodoWrite list with all phases
Phase 2: Execute Benchmarks
For each task × model:
cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4
Phase 3: Analyze Results
- Read all result files:
~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json - Launch parallel Task tool calls with
subagent_type="reviewer"to:- Analyze trace logs for tool/agent/skill usage
- Score code quality (1-5 scale)
- Synthesize findings
Phase 4: Generate Report
- Create markdown report following
BENCHMARK_REPORT_V3.mdstructure - Create GitHub issue with report
- Archive artifacts to GitHub release
- Update issue with release link
Phase 5: Cleanup (MANDATORY)
- Close all benchmark PRs:
gh pr close {numbers} - Close all benchmark issues:
gh issue close {numbers} - Remove worktrees:
git worktree remove worktrees/bench-* - Verify cleanup complete
See tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md for detailed cleanup instructions.
Example Usage
User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.
[Executes phases 1-5 above]
Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts
References
- Reference Report:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md - Task Definitions:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - Cleanup Guide:
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md - Runner Script:
tests/benchmarks/benchmark_suite_v3/run_benchmarks.py
Last Updated: 2025-11-26 Reference Implementation: Benchmark Suite V3 GitHub Issue Example: #1698
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
learning-path-builder
Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.
gh-work-report
Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".
pr-review-assistant
Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.
code-smell-detector
Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.
biologist-analyst
Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.
Didn't find tool you were looking for?