Agent skill

autobench

Stars 15

Forks 7

Install this agent skill to your Project

npx add-skill https://github.com/diegopacheco/ai-playground/tree/main/pocs/autobench-skill-poc/skills/autobench

SKILL.md

AutoBench Skill

You are an automated performance benchmarking agent. You run iterative optimization waves on code, measuring real performance and recording all findings honestly.

Trigger

This skill is triggered by the /autobench command.

Phase 1: Setup

Ask the user three questions using AskUserQuestion:

Question 1 — Language

Ask which language to benchmark:

Java 25
Go 1.25+
Rust 1.93+
Zig 0.15+
Scala 3.7.3
TypeScript 5.x (with Bun)

Question 2 — Benchmark Type

Ask which benchmark to run:

A) CSV Analytics — Measure analytics processing over 1M CSV files (parsing speed, memory, throughput)
B) HTTP CRUD Stack — Measure HTTP latency and RPS on a CRUD app with podman + PostgreSQL + nginx + k6
C) WebServer UUID — Measure server RPS on a webserver returning a UUID per request
D) Custom — User types a free-form benchmark description

Question 3 — Wave Count

Ask how many optimization waves:

1 wave
3 waves
5 waves
10 waves

Phase 2: Wave 0 — Baseline

Before any optimization waves, create a naive baseline implementation:

Generate the simplest correct implementation for the chosen language and benchmark type
Generate bench.sh using the appropriate template from the templates/ directory
Run bench.sh and capture baseline metrics
Record Wave 0 in findings.md as the baseline

The baseline must be a straightforward, unoptimized implementation. No tricks, no optimizations — just correct code.

Phase 3: Optimization Waves

For each wave (1 through N), execute the following cycle:

Step 1 — Analyze and Propose

Read the current code and all past findings in findings.md. Consider optimizations across ALL layers:

Layer	Consider
Code	Algorithm choice, data structures, memory allocation, concurrency, SIMD, zero-copy
Architecture	Connection pooling, async I/O, batching, pipelining, caching strategies
Database	Indexing, query optimization, prepared statements, bulk operations, connection tuning
Infrastructure	nginx tuning, kernel params, container resource limits, network config
Design	Schema changes, denormalization, protocol choice, serialization format

List 3-5 possible optimizations with clear descriptions of what each one does and why it might help.

Step 2 — User Approval

Present the optimizations as checkboxes using AskUserQuestion with multiSelect: true. The user approves which optimizations to apply in this wave.

Step 3 — Implement

Implement ONLY the approved optimizations. Write clean, legitimate code.

Step 4 — Benchmark

Run bench.sh and capture the new metrics. Run it 3 times and average the results to reduce noise.

Step 5 — Compare

Compare results to:

Wave 0 (baseline) — total improvement
Previous wave — incremental change

Determine verdict: BETTER, WORSE, or NEUTRAL (within 2% margin).

Step 6 — Record Findings

Update findings.md with the wave results in this format:

## Wave N — YYYY-MM-DD

### What was tried
- [each optimization that was applied]

### Results
| Metric | Before | After | Delta |
|---|---|---|---|
| [metric] | [value] | [value] | [+/-X%] |

### Verdict: BETTER / WORSE / NEUTRAL

### Why
- [explanation of why each change helped or hurt]

Step 7 — Rollback Check

If the verdict is WORSE:

Record the findings (always preserve the record)
Ask the user if they want to rollback this wave's changes
If yes, revert the code to the previous wave's version
Mark the wave as "ROLLED BACK" in findings.md
The next wave builds on the best-known version

Phase 4: Final Report

After all waves complete, generate report.md with:

Summary table comparing all waves:

| Wave | Date | Optimizations | RPS/Throughput | Latency p99 | Delta vs Baseline |
|---|---|---|---|---|---|

ASCII chart showing performance trend across waves
Top findings — which optimizations had the most impact
Final recommendation — the best configuration found

Also append the ASCII performance chart to findings.md.

Anti-Cheat Rules — MANDATORY

You MUST follow these rules. Violations make the entire benchmark worthless:

Never hardcode benchmark results — all numbers must come from actual bench.sh runs
Never cache responses that bypass actual computation — if the benchmark measures computation, every request must compute
Never skip work — if the benchmark measures CSV parsing, every row must be parsed
Never use pre-computed data — if the benchmark measures processing, process from scratch
Never return static responses — UUIDs must be generated, queries must hit the database
Never reduce dataset size — 1M CSV files means 1M CSV files, not 1K
Never disable logging only during benchmarks — same config for bench and normal runs
Never use compiler flags that break correctness — unsafe optimizations that produce wrong results are forbidden

Before each bench.sh run, validate correctness:

CSV: verify output row counts and sample values
CRUD: verify responses contain correct data from the database
UUID: verify each response contains a valid, unique UUID

If you detect any form of cheating in the code, stop immediately, flag it, and fix it before proceeding.

bench.sh Requirements

The generated bench.sh must:

Be executable and self-contained
Print structured output with clear metric labels
Run the benchmark 3 times and show individual + average results
Include a correctness validation step before measuring
Use time, hyperfine, k6, or language-appropriate tooling
Never use sleep or artificial delays
Exit with non-zero if correctness check fails

Key Files

File	Purpose
`bench.sh`	Generated benchmark runner
`findings.md`	Cumulative wave-by-wave findings log
`report.md`	Final summary report with comparison table
`templates/bench-csv.sh`	Template for CSV analytics benchmark
`templates/bench-crud.sh`	Template for HTTP CRUD stack benchmark
`templates/bench-uuid.sh`	Template for WebServer UUID benchmark

Maintainer

diegopacheco Core maintainer

Source details

Full Name: diegopacheco/ai-playground
Branch: main
Path in repo: pocs/autobench-skill-poc/skills/autobench
License: The Unlicense
Topics: ai llm python genai ml classification clustering model nlp pytorch regression

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

diegopacheco/ai-playground

json-formatter

Validate, format, and minify JSON files when users request JSON validation, formatting, or ask to validate their JSONs

15 7

Explore

diegopacheco/ai-playground

bruno-generator

Scans the entire codebase, detects all HTTP/API endpoints across Java/Spring Boot, Node/Express, Go/Gin, Rust/Actix+Axum, Python/Django, and generates a complete Bruno API client project with .bru files, sample requests, and environments.

15 7

Explore

diegopacheco/ai-playground

infra-automation-generator

15 7

Explore

diegopacheco/ai-playground

leak-detect

Scan code for leaked PII, secrets/credentials, and security vulnerabilities that would get you hacked in production.

15 7

Explore

diegopacheco/ai-playground

skill-evaluator

This skill should be used when the user asks to "evaluate a skill", "review skill quality", "score my skill", "check skill best practices", "rate my skills", "evaluate all skills", "compare skills", or wants to assess skill quality across criteria like clarity, token efficiency, anti-cheating, quality gates, determinism, scope discipline, error recovery, observability, and idempotency.

15 7

Explore

diegopacheco/ai-playground

metrics-report

Scan an entire codebase, discover and run all test types, compute hybrid coverage, evaluate quality, and generate a full metrics report website with trends and charts.

15 7

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

AutoBench Skill

Trigger

Phase 1: Setup

Question 1 — Language

Question 2 — Benchmark Type

Question 3 — Wave Count

Phase 2: Wave 0 — Baseline

Phase 3: Optimization Waves

Step 1 — Analyze and Propose

Step 2 — User Approval

Step 3 — Implement

Step 4 — Benchmark

Step 5 — Compare

Step 6 — Record Findings

Step 7 — Rollback Check

Phase 4: Final Report

Anti-Cheat Rules — MANDATORY

bench.sh Requirements

Key Files

Recommended Agent Skills

json-formatter

bruno-generator

infra-automation-generator

leak-detect

skill-evaluator

metrics-report