Agent skill
add-benchmark
Add a new SWE benchmark task from a real GitHub bug-fix. Use when the user provides a GitHub issue or PR URL and wants to add it to the bench-swe pipeline.
Install this agent skill to your Project
npx add-skill https://github.com/ory/lumen/tree/main/.claude/skills/add-benchmark
SKILL.md
Add SWE Benchmark
Add a new benchmark task to the bench-swe pipeline from a real GitHub bug-fix. The human provides the GitHub issue or PR URL; the agent handles extraction, validation, and file creation.
Arguments
- url (required): GitHub issue or PR URL (e.g.
https://github.com/gorilla/mux/issues/534orhttps://github.com/gorilla/mux/pull/585) - language (required): One of: go, python, typescript, javascript, rust, ruby, java, c, cpp, php, csharp
Repository selection criteria
Good benchmark repos are focused libraries with a clear bug — not large applications. Before submitting a URL, prefer repos that are:
- Size: < 50 MB and < 800 source files (excludes vendor/node_modules)
- Dependencies: < 50 direct dependencies (go.mod, package.json, etc.)
- Scope: a library or small service, not a monorepo or full application
The agent will reject repos that exceed these limits.
Steps
-
Dispatch the
task-curatoragent with the provided arguments. The agent will:- Validate inputs (URL, language)
- Check repository size and dependency count (rejects oversized repos)
- Resolve the fix PR (from issue or directly)
- Clone the repo, extract base/fix commits, and generate the gold patch
- Determine the test command from repo conventions
- Write task JSON to
bench-swe/tasks/{language}/and patch tobench-swe/patches/ - Run 5 inline verification checks (patch applies, files match, no leaks, schema completeness, no test files in patch)
- Fix any issues found during verification
-
Report the result including:
- Task ID, repo, issue URL
- Files and lines changed
- Verification table
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
doctor
Run a health check on the bundled Lumen semantic search setup for the current project, verify backend reachability and index freshness, and summarize remediation steps.
reindex
Refresh or rebuild the bundled Lumen index for the current project, preferring MCP-driven refreshes and using the CLI only for an explicit clean rebuild.
verl-rl-training
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
openrlhf-training
High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.
gguf-quantization
GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
Claude Code Guide
Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.
Didn't find tool you were looking for?