Agent skill

swe-bench-lite

Quick-start command to run SWE-bench Lite evaluation with sensible defaults.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/benchmark-swe-lite

SKILL.md

Instructions

This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.

What This Skill Does

This skill runs a quick SWE-bench Lite evaluation with:

  • 5 sample tasks (configurable)
  • Verbose output for visibility
  • Results saved to results.json
  • Report saved to report.md

Prerequisites Check

Before running, verify:

  1. Docker is running:

    bash
    docker ps
    
  2. API key is set:

    bash
    echo $ANTHROPIC_API_KEY
    
  3. Config file exists:

    • Check for mcpbr.yaml in the current directory
    • If missing, run mcpbr init to generate it

Default Command

The default command for SWE-bench Lite:

bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md

Customization Options

Users can customize the run by modifying:

  • Sample size: Change -n 5 to any number (or remove for full dataset)
  • Config file: Change -c mcpbr.yaml to point to a different config
  • Verbosity: Use -vv for very verbose output
  • Output files: Change results.json and report.md to different paths

Example Variations

Minimal quick test (1 task)

bash
mcpbr run -c mcpbr.yaml -n 1 -v

Full evaluation (all ~300 tasks)

bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json

MCP-only (skip baseline)

bash
mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json

Specific tasks

bash
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v

Expected Runtime & Cost

For 5 tasks with default settings:

  • Runtime: 15-30 minutes (depends on task complexity)
  • Cost: $2-5 (depends on task complexity and model used)

What to Do If It Fails

  1. Docker not running: Start Docker Desktop
  2. API key missing: Set with export ANTHROPIC_API_KEY="sk-ant-..."
  3. Config missing: Run mcpbr init to generate default config
  4. Config invalid: Check that {workdir} placeholder is in the args array
  5. MCP server fails: Test the server command independently

After the Run

Once complete, you'll have:

  • results.json: Full evaluation data with metrics, token usage, and per-task results
  • report.md: Human-readable summary with resolution rates and comparisons
  • Console output: Real-time progress and summary table

Review the results to see how your MCP server performed compared to the baseline!

Pro Tips

  • Start with -n 1 to verify everything works before running larger evaluations
  • Use --log-dir logs/ to save detailed per-task logs for debugging
  • Compare multiple runs by changing the MCP server config between runs
  • Use --baseline-results baseline.json to detect regressions between versions

Didn't find tool you were looking for?

Be as detailed as possible for better results