Agent skills
model-evaluation-benchmark

Agent skill

model-evaluation-benchmark

Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3. Auto-activates for model benchmarking, comparison evaluation, or performance testing between AI models.

View SKILL.md on GitHub Repository

Stars 45

Forks 28

Install this agent skill to your Project

npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/model-evaluation-benchmark

SKILL.md

Model Evaluation Benchmark Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

Skill Description

This skill orchestrates end-to-end model evaluation benchmarks that measure:

Efficiency: Duration, turns, cost, tool calls
Quality: Code quality scores via reviewer agents
Workflow Adherence: Subagent calls, skills used, workflow step compliance
Artifacts: GitHub issues, PRs, documentation generated

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

When to Use

✅ Use when:

Comparing AI models (Opus vs Sonnet, etc.)
Measuring workflow adherence
Generating comprehensive benchmark reports
Need reproducible benchmarking

❌ Don't use when:

Simple code reviews (use reviewer)
Performance profiling (use optimizer)
Architecture decisions (use architect)

Execution Instructions

When this skill is invoked, follow these steps:

Phase 1: Setup

Read tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
Create TodoWrite list with all phases

Phase 2: Execute Benchmarks

For each task × model:

bash

cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

Phase 3: Analyze Results

Read all result files: ~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json
Launch parallel Task tool calls with subagent_type="reviewer" to:
- Analyze trace logs for tool/agent/skill usage
- Score code quality (1-5 scale)
Synthesize findings

Phase 4: Generate Report

Create markdown report following BENCHMARK_REPORT_V3.md structure
Create GitHub issue with report
Archive artifacts to GitHub release
Update issue with release link

Phase 5: Cleanup (MANDATORY)

Close all benchmark PRs: gh pr close {numbers}
Close all benchmark issues: gh issue close {numbers}
Remove worktrees: git worktree remove worktrees/bench-*
Verify cleanup complete

See tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md for detailed cleanup instructions.

Example Usage

User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.

[Executes phases 1-5 above]

Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts

References

Reference Report: tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md
Task Definitions: tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
Cleanup Guide: tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md
Runner Script: tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

Last Updated: 2025-11-26 Reference Implementation: Benchmark Suite V3 GitHub Issue Example: #1698

Maintainer

rysweet Core maintainer

Source details

Full Name: rysweet/amplihack
Branch: main
Path in repo: .claude/skills/model-evaluation-benchmark

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

rysweet/amplihack

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

45 28

Explore

rysweet/amplihack

learning-path-builder

Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.

45 28

Explore

rysweet/amplihack

gh-work-report

Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".

45 28

Explore

rysweet/amplihack

pr-review-assistant

Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.

45 28

Explore

rysweet/amplihack

code-smell-detector

Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.

45 28

Explore

rysweet/amplihack

biologist-analyst

Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.

45 28

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Model Evaluation Benchmark Skill

Skill Description

When to Use

Execution Instructions

Phase 1: Setup

Phase 2: Execute Benchmarks

Phase 3: Analyze Results

Phase 4: Generate Report

Phase 5: Cleanup (MANDATORY)

Example Usage

References

Recommended Agent Skills

chemist-analyst

learning-path-builder

gh-work-report

pr-review-assistant

code-smell-detector

biologist-analyst