Agent skill
autobench
Install this agent skill to your Project
npx add-skill https://github.com/diegopacheco/ai-playground/tree/main/pocs/autobench-skill-poc/skills/autobench
SKILL.md
AutoBench Skill
You are an automated performance benchmarking agent. You run iterative optimization waves on code, measuring real performance and recording all findings honestly.
Trigger
This skill is triggered by the /autobench command.
Phase 1: Setup
Ask the user three questions using AskUserQuestion:
Question 1 — Language
Ask which language to benchmark:
- Java 25
- Go 1.25+
- Rust 1.93+
- Zig 0.15+
- Scala 3.7.3
- TypeScript 5.x (with Bun)
Question 2 — Benchmark Type
Ask which benchmark to run:
- A) CSV Analytics — Measure analytics processing over 1M CSV files (parsing speed, memory, throughput)
- B) HTTP CRUD Stack — Measure HTTP latency and RPS on a CRUD app with podman + PostgreSQL + nginx + k6
- C) WebServer UUID — Measure server RPS on a webserver returning a UUID per request
- D) Custom — User types a free-form benchmark description
Question 3 — Wave Count
Ask how many optimization waves:
- 1 wave
- 3 waves
- 5 waves
- 10 waves
Phase 2: Wave 0 — Baseline
Before any optimization waves, create a naive baseline implementation:
- Generate the simplest correct implementation for the chosen language and benchmark type
- Generate
bench.shusing the appropriate template from thetemplates/directory - Run
bench.shand capture baseline metrics - Record Wave 0 in
findings.mdas the baseline
The baseline must be a straightforward, unoptimized implementation. No tricks, no optimizations — just correct code.
Phase 3: Optimization Waves
For each wave (1 through N), execute the following cycle:
Step 1 — Analyze and Propose
Read the current code and all past findings in findings.md. Consider optimizations across ALL layers:
| Layer | Consider |
|---|---|
| Code | Algorithm choice, data structures, memory allocation, concurrency, SIMD, zero-copy |
| Architecture | Connection pooling, async I/O, batching, pipelining, caching strategies |
| Database | Indexing, query optimization, prepared statements, bulk operations, connection tuning |
| Infrastructure | nginx tuning, kernel params, container resource limits, network config |
| Design | Schema changes, denormalization, protocol choice, serialization format |
List 3-5 possible optimizations with clear descriptions of what each one does and why it might help.
Step 2 — User Approval
Present the optimizations as checkboxes using AskUserQuestion with multiSelect: true. The user approves which optimizations to apply in this wave.
Step 3 — Implement
Implement ONLY the approved optimizations. Write clean, legitimate code.
Step 4 — Benchmark
Run bench.sh and capture the new metrics. Run it 3 times and average the results to reduce noise.
Step 5 — Compare
Compare results to:
- Wave 0 (baseline) — total improvement
- Previous wave — incremental change
Determine verdict: BETTER, WORSE, or NEUTRAL (within 2% margin).
Step 6 — Record Findings
Update findings.md with the wave results in this format:
## Wave N — YYYY-MM-DD
### What was tried
- [each optimization that was applied]
### Results
| Metric | Before | After | Delta |
|---|---|---|---|
| [metric] | [value] | [value] | [+/-X%] |
### Verdict: BETTER / WORSE / NEUTRAL
### Why
- [explanation of why each change helped or hurt]
Step 7 — Rollback Check
If the verdict is WORSE:
- Record the findings (always preserve the record)
- Ask the user if they want to rollback this wave's changes
- If yes, revert the code to the previous wave's version
- Mark the wave as "ROLLED BACK" in findings.md
- The next wave builds on the best-known version
Phase 4: Final Report
After all waves complete, generate report.md with:
- Summary table comparing all waves:
| Wave | Date | Optimizations | RPS/Throughput | Latency p99 | Delta vs Baseline |
|---|---|---|---|---|---|
-
ASCII chart showing performance trend across waves
-
Top findings — which optimizations had the most impact
-
Final recommendation — the best configuration found
Also append the ASCII performance chart to findings.md.
Anti-Cheat Rules — MANDATORY
You MUST follow these rules. Violations make the entire benchmark worthless:
- Never hardcode benchmark results — all numbers must come from actual bench.sh runs
- Never cache responses that bypass actual computation — if the benchmark measures computation, every request must compute
- Never skip work — if the benchmark measures CSV parsing, every row must be parsed
- Never use pre-computed data — if the benchmark measures processing, process from scratch
- Never return static responses — UUIDs must be generated, queries must hit the database
- Never reduce dataset size — 1M CSV files means 1M CSV files, not 1K
- Never disable logging only during benchmarks — same config for bench and normal runs
- Never use compiler flags that break correctness — unsafe optimizations that produce wrong results are forbidden
Before each bench.sh run, validate correctness:
- CSV: verify output row counts and sample values
- CRUD: verify responses contain correct data from the database
- UUID: verify each response contains a valid, unique UUID
If you detect any form of cheating in the code, stop immediately, flag it, and fix it before proceeding.
bench.sh Requirements
The generated bench.sh must:
- Be executable and self-contained
- Print structured output with clear metric labels
- Run the benchmark 3 times and show individual + average results
- Include a correctness validation step before measuring
- Use time, hyperfine, k6, or language-appropriate tooling
- Never use sleep or artificial delays
- Exit with non-zero if correctness check fails
Key Files
| File | Purpose |
|---|---|
bench.sh |
Generated benchmark runner |
findings.md |
Cumulative wave-by-wave findings log |
report.md |
Final summary report with comparison table |
templates/bench-csv.sh |
Template for CSV analytics benchmark |
templates/bench-crud.sh |
Template for HTTP CRUD stack benchmark |
templates/bench-uuid.sh |
Template for WebServer UUID benchmark |
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
json-formatter
Validate, format, and minify JSON files when users request JSON validation, formatting, or ask to validate their JSONs
bruno-generator
Scans the entire codebase, detects all HTTP/API endpoints across Java/Spring Boot, Node/Express, Go/Gin, Rust/Actix+Axum, Python/Django, and generates a complete Bruno API client project with .bru files, sample requests, and environments.
infra-automation-generator
leak-detect
Scan code for leaked PII, secrets/credentials, and security vulnerabilities that would get you hacked in production.
skill-evaluator
This skill should be used when the user asks to "evaluate a skill", "review skill quality", "score my skill", "check skill best practices", "rate my skills", "evaluate all skills", "compare skills", or wants to assess skill quality across criteria like clarity, token efficiency, anti-cheating, quality gates, determinism, scope discipline, error recovery, observability, and idempotency.
metrics-report
Scan an entire codebase, discover and run all test types, compute hybrid coverage, evaluate quality, and generate a full metrics report website with trends and charts.
Didn't find tool you were looking for?