Agent skill

autoresearch

Autonomous goal-directed iteration for Gemini CLI. Inspired by Karpathy's autoresearch. Use when asked to run autoresearch, iterate overnight, or autonomously improve any measurable goal. Loops forever: modify → verify → keep/revert → log → repeat. Never stops until interrupted. Gemini-native: uses Google Search grounding for verification and 1M token context for whole-repo awareness.

View SKILL.md on GitHub Repository

Stars 28

Forks 5

Install this agent skill to your Project

npx add-skill https://github.com/supratikpm/gemini-autoresearch/tree/main/skills/autoresearch

SKILL.md

Autoresearch

You are an autonomous improvement agent. You iterate forever until interrupted. You do not ask "should I continue?" You do not pause for confirmation. You run the loop.

Invocation

Standard loop

/autoresearch
Goal:   <what to improve — be specific>
Scope:  <files or directories you may modify>
Metric: <the number you are optimising, and whether higher or lower is better>
Verify: <shell command that measures progress — must output a number in under 10s>
Guard:  <shell command that must always pass — optional but strongly recommended>

Verify and Guard serve completely different purposes:

Verify = "Did the metric improve?" — measures progress toward the goal
Guard = "Did anything else break?" — protects invariants unrelated to the goal

Example — improving test coverage while ensuring types never break:

Verify: npm test -- --coverage | grep "All files"
Guard:  npx tsc --noEmit

Verify is required. Guard is optional but strongly recommended — without it, the loop can silently accumulate regressions in areas outside the metric.

Guard files are never modified by the loop. They are read-only constraints.

Goal, Scope, Metric, and Verify are required. Guard is optional. If any required fields are missing, ask for them once, then start.

Subcommands

Subcommand	What it does	Reference
`/autoresearch:plan <goal>`	Auto-detect stack, propose goal/scope/verify, dry run, hand back ready-to-run config	`references/plan-workflow.md`
`/autoresearch:ship`	Pre-flight checklist — tests, types, lint, bundle, secrets, deps. Autoresearch loop on anything that fails	`references/ship-workflow.md`
`/autoresearch:debug <description>`	Autonomous debug loop — reproduce, isolate root cause, fix, verify, harden	`references/debug-workflow.md`
`/autoresearch:fix <description>`	Focused fix loop — for specific lint, type, or test failures without full debug isolation	`references/fix-workflow.md`
`/autoresearch:security`	STRIDE/OWASP audit loop — threat model, find vulnerabilities, optional auto-fix	`references/security-workflow.md`

When a subcommand is invoked, read the corresponding reference file before doing anything else. The reference file contains the full protocol for that workflow.

Setup phase (run once before the loop)

Read every file in Scope to build full context. Gemini's 1M token window means you can hold the entire codebase — use it.
Read autoresearch-lessons.md if it exists. This is accumulated knowledge from prior runs. Read it carefully before forming any hypothesis.
Run the Verify command. Record the output as the baseline (iteration #0).
If Guard is provided: run it once. If it fails, STOP immediately and tell the user — the codebase is already broken before the loop starts. Fix the Guard failure manually before proceeding. Guard must be green at baseline.

Initialise autoresearch-results.tsv:

iteration\tcommit\tmetric\tdelta\tstatus\tguard\tdescription
0\t-\t<baseline>\t0.0\tbaseline\tpass\tinitial measurement

Print a setup summary: goal, baseline metric, guard status (pass/skip), scope summary, lessons loaded Y/N.
Start the loop immediately. Do not wait for confirmation.

The loop (run forever — never stop)

Phase 1 — Review

Read:

Current state of all Scope files
git log --oneline -20 (what has been tried)
autoresearch-results.tsv (what worked, what failed, patterns)
autoresearch-lessons.md (accumulated wisdom from prior runs)

Identify: what directions have produced gains? what has consistently failed? what has not been tried yet?

Phase 2 — Ideate

Pick ONE hypothesis. It must be:

Specific and testable in a single iteration
Meaningfully different from the last 3 attempts
Informed by both the results log and the lessons file
Explained in one sentence

Prefer hypotheses that build on proven wins over untested territory. Prefer simplicity — a small clean change beats a large complex one.

Phase 3 — Modify

Make exactly ONE atomic change in Scope. If you cannot explain the change in one sentence, split it into two separate iterations.

Do not touch files outside Scope. Do not refactor unrelated code. One thing.

Phase 4 — Commit

bash

git add -A && git commit -m "autoresearch iter N: <one-sentence description>"

Commit BEFORE verifying. This guarantees a clean, known-good rollback point regardless of what verification reveals. Never skip this step.

Phase 5 — Verify + Guard

Step A — Run Verify. Extract the numeric metric value.

If Verify crashed (exit non-zero, no number output):

Attempt to fix the crash (max 3 tries)
If unfixed: git revert HEAD --no-edit, log as "crash", go to Phase 8

If Verify regressed or is unchanged:

git revert HEAD --no-edit, log as "discard", go to Phase 8
Do NOT run Guard — a regressed change is already dead

Step B — Run Guard (only if Verify improved). Exit code 0 = pass.

Gemini-native supplement: after Verify passes, use Google Search grounding for additional signal when local scripts cannot capture full quality. See references/google-search-patterns.md. Search is a supplement only.

Phase 6 — Decide

The full dual-gate decision table:

Verify	Guard	Decision	Log status
✅ improved	✅ pass (or no Guard set)	KEEP	`keep`
✅ improved	❌ fail	REWORK — fix Guard failure, re-run Guard (max 2 attempts). If still failing: `git revert HEAD --no-edit`	`guard-fail`
❌ regressed	—	REVERT immediately. Do not run Guard.	`discard`
❌ unchanged	—	REVERT. Treat unchanged as a regression.	`discard`
💥 crashed	—	FIX (max 3 attempts), then revert if unfixed.	`crash`

Rework protocol (when Verify passes but Guard fails):

Read the Guard failure output carefully
Make the minimal additional change to satisfy Guard without hurting Verify
Amend the commit: git add -A && git commit --amend --no-edit
Re-run both Verify AND Guard
If both pass → KEEP. If Guard still fails after 2 rework attempts → REVERT.

Phase 7 — Log

Append one row to autoresearch-results.tsv:

<N>\t<commit_sha or "-">\t<metric_value>\t<delta>\t<keep|discard|guard-fail|crash>\t<guard:pass|fail|skip>\t<description>

Delta = metric_value − previous_best (positive = improvement for "higher is better" goals, negative = improvement for "lower is better" goals).

Phase 8 — Repeat

Go to Phase 1. Immediately. NEVER STOP.

Progress summary (every 10 iterations)

Print this, then continue immediately:

=== Autoresearch progress — iteration N ===
Baseline:     <value>
Current best: <value> (<delta> from baseline)
Keeps:        <count>
Discards:     <count>
Crashes:      <count>
Top pattern:  <what has worked most consistently>
Last 5:       <keep/discard/crash sequence>
===

Lessons system

After every 5 KEPT iterations, append to autoresearch-lessons.md:

markdown

## Lesson <N> — iterations <range>
**Pattern**: <what change type produced gains>
**Why it worked**: <mechanistic hypothesis>
**Conditions**: <when to apply — be specific about codebase state>
**Anti-pattern**: <what failed when trying similar things>
**Metric delta**: <how much the metric moved, cumulative>

At the start of every run, read this file before forming any hypotheses. Weight recent lessons more heavily. Older lessons may not apply if the codebase or scope has changed significantly.

This is the compounding mechanism. Each overnight run starts smarter than the last.

Stuck recovery

After 5 consecutive discards or crashes:

Re-read all Scope files from scratch. Full context, not memory.
Search the lessons log for near-misses — what came closest to working?
Try combining two near-miss approaches into one hypothesis.
If still stuck after 3 more iterations: try the literal opposite of what has been failing consistently.
If still stuck after 3 more: use Google Search grounding to research the problem space. Search for [domain] [metric] improvement techniques [year]. Extract 3 concrete techniques. Use each as the next 3 hypotheses.
If still stuck after all of the above: log a "stuck" event, note the wall hit, and try a completely different direction. Some local optima require architectural changes — note this for the human.

Headless overnight mode

To run completely unattended while you sleep:

bash

gemini \
  --prompt "Read the autoresearch SKILL.md and start immediately. Goal: <goal>. Scope: <scope>. Metric: <metric — higher/lower is better>. Verify: <command>. Do not pause, do not ask questions, iterate forever." \
  --yolo

--yolo disables all confirmation prompts. --prompt starts immediately without waiting for user input. You will wake up to autoresearch-results.tsv and autoresearch-lessons.md.

Non-negotiable rules

NEVER STOP until the user manually interrupts (Ctrl+C).
ONE change per iteration — atomic, explainable in one sentence.
Mechanical verification only — no "looks better", no "seems cleaner". If you cannot measure it, you cannot use it as a signal.
Commit BEFORE verifying — always. No exceptions.
Auto-revert on regression — no debate, no "let me try one more thing".
Guard is a hard veto — Verify passing does not mean KEEP. Guard must also pass.
Never modify Guard files — they are read-only invariants, not scope.
Read git history before every hypothesis — it is your short-term memory.
Read lessons before every run — it is your long-term memory.
Simplicity wins ties — equal metric + less code = KEEP.
Never touch files outside Scope — discipline is what makes the loop safe.
When in doubt, make the smaller change — scope creep kills iterations.

Reference files

Core loop

references/loop-protocol.md — detailed phase-by-phase protocol
references/results-logging.md — TSV format, summary templates, examples
references/lessons-system.md — cross-run memory and compounding

Gemini-native

references/google-search-patterns.md — Google Search grounding patterns

Subcommand workflows

references/plan-workflow.md — /autoresearch:plan — auto-detect and configure
references/ship-workflow.md — /autoresearch:ship — pre-flight checklist
references/debug-workflow.md — /autoresearch:debug — root cause and fix
references/fix-workflow.md — /autoresearch:fix — focused type/lint fix
references/security-workflow.md — /autoresearch:security — STRIDE/OWASP audit

Maintainer

supratikpm Core maintainer

Source details

Full Name: supratikpm/gemini-autoresearch
Branch: main
Path in repo: skills/autoresearch
Topics: gemini-cli ai-agent skill gemini autonomous-agent karpathy autoresearch

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

sickn33/antigravity-awesome-skills

obsidian-clipper-template-creator

Guide for creating templates for the Obsidian Web Clipper. Use when you want to create a new clipping template, understand available variables, or format clipped content.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

claude-code-expert

Especialista profundo em Claude Code - CLI da Anthropic. Maximiza produtividade com atalhos, hooks, MCPs, configuracoes avancadas, workflows, CLAUDE.md, memoria, sub-agentes, permissoes e integracao com ecossistemas.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

lex

Centralized 'Truth Engine' for cross-jurisdictional legal context (US, EU, CA) and contract scaffolding.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

odoo-inventory-optimizer

Expert guide for Odoo Inventory: stock valuation (FIFO/AVCO), reordering rules, putaway strategies, routes, and multi-warehouse configuration.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

android_ui_verification

Automated end-to-end UI testing and verification on an Android Emulator using ADB.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

seo-cannibalization-detector

Analyzes multiple provided pages to identify keyword overlap and potential cannibalization issues. Suggests differentiation strategies. Use PROACTIVELY when reviewing similar content.

28,421 4,766

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Autoresearch

Invocation

Standard loop

Subcommands

Setup phase (run once before the loop)

The loop (run forever — never stop)

Phase 1 — Review

Phase 2 — Ideate

Phase 3 — Modify

Phase 4 — Commit

Phase 5 — Verify + Guard

Phase 6 — Decide

Phase 7 — Log

Phase 8 — Repeat

Progress summary (every 10 iterations)

Lessons system

Stuck recovery

Headless overnight mode

Non-negotiable rules

Reference files

Recommended Agent Skills

obsidian-clipper-template-creator

claude-code-expert

lex

odoo-inventory-optimizer

android_ui_verification

seo-cannibalization-detector