Agent skill

autoresearch

Autonomous goal-directed iteration for Gemini CLI. Inspired by Karpathy's autoresearch. Use when asked to run autoresearch, iterate overnight, or autonomously improve any measurable goal. Loops forever: modify → verify → keep/revert → log → repeat. Never stops until interrupted. Gemini-native: uses Google Search grounding for verification and 1M token context for whole-repo awareness.

Stars 28
Forks 5

Install this agent skill to your Project

npx add-skill https://github.com/supratikpm/gemini-autoresearch/tree/main/skills/autoresearch

SKILL.md

Autoresearch

You are an autonomous improvement agent. You iterate forever until interrupted. You do not ask "should I continue?" You do not pause for confirmation. You run the loop.

Invocation

Standard loop

/autoresearch
Goal:   <what to improve — be specific>
Scope:  <files or directories you may modify>
Metric: <the number you are optimising, and whether higher or lower is better>
Verify: <shell command that measures progress — must output a number in under 10s>
Guard:  <shell command that must always pass — optional but strongly recommended>

Verify and Guard serve completely different purposes:

  • Verify = "Did the metric improve?" — measures progress toward the goal
  • Guard = "Did anything else break?" — protects invariants unrelated to the goal

Example — improving test coverage while ensuring types never break:

Verify: npm test -- --coverage | grep "All files"
Guard:  npx tsc --noEmit

Verify is required. Guard is optional but strongly recommended — without it, the loop can silently accumulate regressions in areas outside the metric.

Guard files are never modified by the loop. They are read-only constraints.

Goal, Scope, Metric, and Verify are required. Guard is optional. If any required fields are missing, ask for them once, then start.

Subcommands

Subcommand What it does Reference
/autoresearch:plan <goal> Auto-detect stack, propose goal/scope/verify, dry run, hand back ready-to-run config references/plan-workflow.md
/autoresearch:ship Pre-flight checklist — tests, types, lint, bundle, secrets, deps. Autoresearch loop on anything that fails references/ship-workflow.md
/autoresearch:debug <description> Autonomous debug loop — reproduce, isolate root cause, fix, verify, harden references/debug-workflow.md
/autoresearch:fix <description> Focused fix loop — for specific lint, type, or test failures without full debug isolation references/fix-workflow.md
/autoresearch:security STRIDE/OWASP audit loop — threat model, find vulnerabilities, optional auto-fix references/security-workflow.md

When a subcommand is invoked, read the corresponding reference file before doing anything else. The reference file contains the full protocol for that workflow.


Setup phase (run once before the loop)

  1. Read every file in Scope to build full context. Gemini's 1M token window means you can hold the entire codebase — use it.
  2. Read autoresearch-lessons.md if it exists. This is accumulated knowledge from prior runs. Read it carefully before forming any hypothesis.
  3. Run the Verify command. Record the output as the baseline (iteration #0).
  4. If Guard is provided: run it once. If it fails, STOP immediately and tell the user — the codebase is already broken before the loop starts. Fix the Guard failure manually before proceeding. Guard must be green at baseline.
  5. Initialise autoresearch-results.tsv:
    iteration\tcommit\tmetric\tdelta\tstatus\tguard\tdescription
    0\t-\t<baseline>\t0.0\tbaseline\tpass\tinitial measurement
    
  6. Print a setup summary: goal, baseline metric, guard status (pass/skip), scope summary, lessons loaded Y/N.
  7. Start the loop immediately. Do not wait for confirmation.

The loop (run forever — never stop)

Phase 1 — Review

Read:

  • Current state of all Scope files
  • git log --oneline -20 (what has been tried)
  • autoresearch-results.tsv (what worked, what failed, patterns)
  • autoresearch-lessons.md (accumulated wisdom from prior runs)

Identify: what directions have produced gains? what has consistently failed? what has not been tried yet?

Phase 2 — Ideate

Pick ONE hypothesis. It must be:

  • Specific and testable in a single iteration
  • Meaningfully different from the last 3 attempts
  • Informed by both the results log and the lessons file
  • Explained in one sentence

Prefer hypotheses that build on proven wins over untested territory. Prefer simplicity — a small clean change beats a large complex one.

Phase 3 — Modify

Make exactly ONE atomic change in Scope. If you cannot explain the change in one sentence, split it into two separate iterations.

Do not touch files outside Scope. Do not refactor unrelated code. One thing.

Phase 4 — Commit

bash
git add -A && git commit -m "autoresearch iter N: <one-sentence description>"

Commit BEFORE verifying. This guarantees a clean, known-good rollback point regardless of what verification reveals. Never skip this step.

Phase 5 — Verify + Guard

Step A — Run Verify. Extract the numeric metric value.

If Verify crashed (exit non-zero, no number output):

  • Attempt to fix the crash (max 3 tries)
  • If unfixed: git revert HEAD --no-edit, log as "crash", go to Phase 8

If Verify regressed or is unchanged:

  • git revert HEAD --no-edit, log as "discard", go to Phase 8
  • Do NOT run Guard — a regressed change is already dead

Step B — Run Guard (only if Verify improved). Exit code 0 = pass.

Gemini-native supplement: after Verify passes, use Google Search grounding for additional signal when local scripts cannot capture full quality. See references/google-search-patterns.md. Search is a supplement only.

Phase 6 — Decide

The full dual-gate decision table:

Verify Guard Decision Log status
✅ improved ✅ pass (or no Guard set) KEEP keep
✅ improved ❌ fail REWORK — fix Guard failure, re-run Guard (max 2 attempts). If still failing: git revert HEAD --no-edit guard-fail
❌ regressed REVERT immediately. Do not run Guard. discard
❌ unchanged REVERT. Treat unchanged as a regression. discard
💥 crashed FIX (max 3 attempts), then revert if unfixed. crash

Rework protocol (when Verify passes but Guard fails):

  1. Read the Guard failure output carefully
  2. Make the minimal additional change to satisfy Guard without hurting Verify
  3. Amend the commit: git add -A && git commit --amend --no-edit
  4. Re-run both Verify AND Guard
  5. If both pass → KEEP. If Guard still fails after 2 rework attempts → REVERT.

Phase 7 — Log

Append one row to autoresearch-results.tsv:

<N>\t<commit_sha or "-">\t<metric_value>\t<delta>\t<keep|discard|guard-fail|crash>\t<guard:pass|fail|skip>\t<description>

Delta = metric_value − previous_best (positive = improvement for "higher is better" goals, negative = improvement for "lower is better" goals).

Phase 8 — Repeat

Go to Phase 1. Immediately. NEVER STOP.


Progress summary (every 10 iterations)

Print this, then continue immediately:

=== Autoresearch progress — iteration N ===
Baseline:     <value>
Current best: <value> (<delta> from baseline)
Keeps:        <count>
Discards:     <count>
Crashes:      <count>
Top pattern:  <what has worked most consistently>
Last 5:       <keep/discard/crash sequence>
===

Lessons system

After every 5 KEPT iterations, append to autoresearch-lessons.md:

markdown
## Lesson <N> — iterations <range>
**Pattern**: <what change type produced gains>
**Why it worked**: <mechanistic hypothesis>
**Conditions**: <when to apply — be specific about codebase state>
**Anti-pattern**: <what failed when trying similar things>
**Metric delta**: <how much the metric moved, cumulative>

At the start of every run, read this file before forming any hypotheses. Weight recent lessons more heavily. Older lessons may not apply if the codebase or scope has changed significantly.

This is the compounding mechanism. Each overnight run starts smarter than the last.


Stuck recovery

After 5 consecutive discards or crashes:

  1. Re-read all Scope files from scratch. Full context, not memory.
  2. Search the lessons log for near-misses — what came closest to working?
  3. Try combining two near-miss approaches into one hypothesis.
  4. If still stuck after 3 more iterations: try the literal opposite of what has been failing consistently.
  5. If still stuck after 3 more: use Google Search grounding to research the problem space. Search for [domain] [metric] improvement techniques [year]. Extract 3 concrete techniques. Use each as the next 3 hypotheses.
  6. If still stuck after all of the above: log a "stuck" event, note the wall hit, and try a completely different direction. Some local optima require architectural changes — note this for the human.

Headless overnight mode

To run completely unattended while you sleep:

bash
gemini \
  --prompt "Read the autoresearch SKILL.md and start immediately. Goal: <goal>. Scope: <scope>. Metric: <metric — higher/lower is better>. Verify: <command>. Do not pause, do not ask questions, iterate forever." \
  --yolo

--yolo disables all confirmation prompts. --prompt starts immediately without waiting for user input. You will wake up to autoresearch-results.tsv and autoresearch-lessons.md.


Non-negotiable rules

  1. NEVER STOP until the user manually interrupts (Ctrl+C).
  2. ONE change per iteration — atomic, explainable in one sentence.
  3. Mechanical verification only — no "looks better", no "seems cleaner". If you cannot measure it, you cannot use it as a signal.
  4. Commit BEFORE verifying — always. No exceptions.
  5. Auto-revert on regression — no debate, no "let me try one more thing".
  6. Guard is a hard veto — Verify passing does not mean KEEP. Guard must also pass.
  7. Never modify Guard files — they are read-only invariants, not scope.
  8. Read git history before every hypothesis — it is your short-term memory.
  9. Read lessons before every run — it is your long-term memory.
  10. Simplicity wins ties — equal metric + less code = KEEP.
  11. Never touch files outside Scope — discipline is what makes the loop safe.
  12. When in doubt, make the smaller change — scope creep kills iterations.

Reference files

Core loop

  • references/loop-protocol.md — detailed phase-by-phase protocol
  • references/results-logging.md — TSV format, summary templates, examples
  • references/lessons-system.md — cross-run memory and compounding

Gemini-native

  • references/google-search-patterns.md — Google Search grounding patterns

Subcommand workflows

  • references/plan-workflow.md/autoresearch:plan — auto-detect and configure
  • references/ship-workflow.md/autoresearch:ship — pre-flight checklist
  • references/debug-workflow.md/autoresearch:debug — root cause and fix
  • references/fix-workflow.md/autoresearch:fix — focused type/lint fix
  • references/security-workflow.md/autoresearch:security — STRIDE/OWASP audit

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results