Agent skill

spec-verify

Spec verification phase - tests, execution, rules audit, code review

View SKILL.md on GitHub Repository

Stars 1,637

Forks 137

Install this agent skill to your Project

npx add-skill https://github.com/maxritter/pilot-shell/tree/main/pilot/skills/spec-verify

SKILL.md

/spec-verify - Verification Phase

Phase 3 of the /spec workflow (features). Runs comprehensive verification: automated checks, code review, program execution, and E2E tests. For bugfix plans, use spec-bugfix-verify instead.

Input: Plan file with Status: COMPLETE Output: Plan status → VERIFIED (success) or loop back to implementation (failure)

⛔ KEY CONSTRAINTS

Run code review when enabled — Step 3.1 launches changes-review via Task(subagent_type="pilot:changes-review") when PILOT_CHANGES_REVIEW_ENABLED is not "false" (read in Step 0). To disable, use Console Settings → Reviewers → Changes Review toggle.
Only changes-review — NEVER spec-review — Do NOT launch spec-review during verification. Do NOT read or reference findings-spec-review-*.json files — they are stale artifacts from the planning phase that were already addressed during implementation. If you encounter a spec-review findings file, ignore it completely.
NO stopping — Everything automatic. Never ask "Should I fix these?"
Fix ALL findings — must_fix AND should_fix. No permission needed.
Code changes finish BEFORE runtime testing — Phase A then Phase B.
Plan file is source of truth — re-read it after auto-compaction, don't rely on conversation memory.
Re-verification after fixes is MANDATORY — fixes can introduce new bugs.
Quality over speed — never rush due to context pressure.

Step 0: Read Toggle Configuration

⛔ Run FIRST, before any other step. Read the reviewer toggle env vars:

bash

echo "REVIEWER=$PILOT_CHANGES_REVIEW_ENABLED CODEX_CHG=$PILOT_CODEX_CHANGES_REVIEW_ENABLED"

Codex reviewers are controlled entirely by Console Settings — the env vars are authoritative.

Reference these values in Steps 3.1, 3.4, and 3.5.

The Process

Phase A — Finalize the code:
  Launch Reviewer → Automated Checks (tests + lint + verify commands) → Feature Parity (if migration) → Collect Review Results → Fix

Phase B — Verify the running program (depth depends on runtime profile):
  Build → Program Execution → Per-Task DoD Audit → E2E

Final:
  Regression check → Worktree sync → Post-merge verification → Update status

Step 3.0: Classify Runtime Profile

Determine verification depth based on what changed:

Profile	Criteria	Phase B Scope
Minimal	No server, no UI, no built artifacts (libraries, CLI tools, hooks, scripts)	Build check only
API	Server/API but no frontend changes	Build + program execution + DoD audit. Skip E2E.
Full	Frontend/UI changes or complex deployment	All Phase B steps

Read the plan's Runtime Environment section (if present) and the changed file types to classify.

Phase A: Finalize the Code

Step 3.0b: Clean Up Stale Spec-Review Findings

⛔ ALWAYS run this step — regardless of whether changes-review is enabled. Spec-review findings are stale artifacts from the planning phase that were already addressed during implementation.

bash

rm -f ~/.pilot/sessions/$PILOT_SESSION_ID/findings-spec-review-*.json

Step 3.1: Launch Code Review Agent (Early)

⛔ If PILOT_CHANGES_REVIEW_ENABLED is "false" (from Step 0), skip this step entirely and proceed to Step 3.2. (Automated checks in Step 3.2 still run; only the agent-based review is skipped.)

When enabled: Launch the reviewer IMMEDIATELY — it works in the background while you run automated checks.

3.1a: Gather Context

bash

git status --short  # Changed files
echo $PILOT_SESSION_ID

Validate session ID: If $PILOT_SESSION_ID is empty, fall back to "default" to avoid writing to ~/.pilot/sessions//.

Collect: changed files list, test framework constraints, runtime environment info, plan risks section.

Derive plan slug from the plan filename: strip the date prefix (YYYY-MM-DD-) and .md extension. Example: 2026-03-02-sku-builder-modal-cleanup.md → sku-builder-modal-cleanup.

Output path: ~/.pilot/sessions/<session-id>/findings-changes-review-<plan-slug>.json

3.1b: Launch

⛔ Delete stale changes-review findings before launching (previous run may have left a file):

bash

rm -f ~/.pilot/sessions/$PILOT_SESSION_ID/findings-changes-review-*.json

Task(
  subagent_type="pilot:changes-review",
  run_in_background=true,
  prompt="""
  **Plan file:** <plan-path>
  **User request:** <original task description that invoked /spec>
  **Changed files:** [file list]
  **Output path:** <absolute path to findings JSON>
  **Runtime environment:** [how to start, port, deploy path]
  **Test framework constraints:** [what it can/cannot test]

  Review implementation: compliance (plan match + user request match), quality (security, bugs, tests, performance), goal (achievement, artifacts, wiring).
  Performance: check for expensive uncached work on hot paths, heavy dependency imports with lighter alternatives, and repeated invocations that redo work when input hasn't changed.
  Write findings JSON to output_path using Write tool.
  IMPORTANT: Include the plan file path in your output JSON as the "plan_file" field.
  """
)

Do NOT wait. Proceed to Codex launch (if enabled) or Step 3.2 immediately.

Codex Adversarial Review (Optional — launch immediately after Claude reviewer)

If PILOT_CODEX_CHANGES_REVIEW_ENABLED is "true" (from Step 0):

Launch Codex review NOW — it runs in parallel with the Claude reviewer above.

Detect companion path:

bash

CODEX_COMPANION=$(ls ~/.claude/plugins/cache/openai-codex/codex/*/scripts/codex-companion.mjs 2>/dev/null | head -1)

Launch adversarial review in background using --scope working-tree (reviews all uncommitted changes regardless of staging state — works in both worktree and non-worktree mode). Include the plan's goal as focus text:

bash

node "$CODEX_COMPANION" adversarial-review --background --scope working-tree "Challenge this implementation: <plan summary/goal>. Plan: <plan-path>. Focus on: wrong approach, missing edge cases, security gaps, untested paths, and design choices that could fail under load."

Capture the job ID from stdout. Do NOT wait — proceed to Step 3.2 immediately.

Step 3.2: Automated Checks

Run all mechanical checks in sequence. Fix any failures before proceeding.

Full test suite — uv run pytest -q / bun test / npm test. Fix failures immediately.
Type checker — basedpyright / tsc --noEmit. Zero errors required.
Linter — ruff check / eslint. Errors are blockers, warnings acceptable.
Coverage — Verify ≥ 80%.
Build — Clean build, zero errors.
File length — Changed production files (non-test): >800 lines consider splitting, >1000 flag for review.
Plan verify commands — For each task's Verify: section, run each command wrapped in timeout 30 <cmd> || echo 'TIMEOUT'. Defer server-dependent commands (containing curl, localhost, http://, browser automation) to Phase B.
Performance audit — For each changed file on a hot path (UI render, request handler, polling loop, CLI inner loop): is expensive work (parsing, serialization, I/O, dependency loading) cached/memoized? Are heavy dependencies imported fully when lighter alternatives exist? Does repeated invocation redo work when input hasn't changed? This is a static code review — no running program needed. Performance issues from missing caching are structural and visible in the source.

Step 3.3: Feature Parity Check (migration/refactoring only)

Skip unless the plan has a Feature Inventory section.

Compare old vs new implementation
Verify each feature exists in new code
Run new code and verify same behavior

If features are MISSING: Add tasks with [MISSING] prefix, set Status: PENDING, increment Iterations, register status change, invoke Skill(skill='spec-implement', args='<plan-path>').

Step 3.4: Collect Review Results

⛔ If PILOT_CHANGES_REVIEW_ENABLED is "false" (from Step 0 — Step 3.1 was skipped), skip this step entirely and proceed to Step 3.6 (Phase B). There are no findings to collect.

When enabled — mandatory. Never skip — even if you're confident, context is high, or tests pass.

⛔ NEVER use TaskOutput to retrieve results — it dumps the full agent transcript into context, wasting thousands of tokens.

Wait for Claude reviewer results (bash polling — NOT Read loop):

bash

OUTPUT_PATH="<findings-path>"
for i in $(seq 1 250); do [ -f "$OUTPUT_PATH" ] && echo "READY" && break; sleep 2; done

Then Read the file once. If not READY after ~8 min, re-launch synchronously.

⛔ Validate findings: After reading the JSON, verify that the plan_file field matches the current plan path. If it doesn't match, the findings are stale from a previous /spec — delete the file, re-launch the reviewer, and wait again.

Fix Claude Reviewer Findings

Fix automatically — no user permission needed.

must_fix → Fix immediately (security, crashes, TDD violations)
should_fix → Fix immediately (spec deviations, missing tests, error handling)
suggestions → Implement if quick

For each fix: implement → run relevant tests → log "Fixed: [title]"

Collect Codex Results (if launched)

If Codex was launched in Step 3.1, collect its results now:

Wait for completion:

bash

node "$CODEX_COMPANION" status <jobId> --wait --timeout-ms 120000 --json

Handle status:
- waitTimedOut: true → Log "Codex review timed out — skipping" and continue.
- job.status is "cancelled" or exit code non-zero → Log "Codex review failed: <failureMessage>" and continue.
- job.status is "completed" → fetch the full result:
Get review findings:

bash

node "$CODEX_COMPANION" result <jobId> --json

Parse the result JSON — look for verdict, findings, details. Map severities: critical/high → must_fix, medium/low → should_fix. Fix all must_fix/should_fix.

Report:

## Code Verification Complete
**Issues Found:** X
### Goal Achievement: N/M truths verified
### Must Fix (N) | Should Fix (N) | Suggestions (N)

Step 3.5: Re-verification (Only for Structural Fixes)

⛔ If PILOT_CHANGES_REVIEW_ENABLED is "false" (from Step 0 — Steps 3.1/3.4 were skipped), skip this step entirely and proceed to Phase B.

When enabled: Skip when fixes were localized (terminology, error handling, test updates, minor bugs). Run tests + lint to confirm, proceed to Phase B.

Re-verify when fixes required new functionality, changed APIs, or significant new code paths: re-launch changes-review, fix new findings. Max 2 iterations before adding remaining issues to plan.

Phase B: Verify the Running Program

All code is finalized. No more code changes except critical bugs found during execution.

If runtime profile is Minimal: Run build check (Step 3.6a), then skip to Final section.

Step 3.6: Build, Deploy, and Verify Code Identity

3.6a: Build

Build/compile the project. Verify zero errors.

3.6b: Deploy (if applicable)

If project builds artifacts deployed separately from source: copy to install location, restart services. Check ps aux | grep <service> before restarting shared services.

3.6c: Code Identity Verification

⛔ Prove the running instance uses your new code before testing it.

Identify a behavioral change unique to this implementation
Craft a request only new code handles correctly (e.g., query with new parameter — new code returns filtered results, old code ignores parameter)
If response matches OLD behavior → redeploy, restart, re-verify
Do NOT proceed to execution testing until code identity is confirmed

Step 3.7: Program Execution Verification

If runtime profile is Minimal: Skip.

⚠️ Parallel spec warning: Before starting a server, check port availability: lsof -i :<port>. If another /spec session occupies it, wait or use a different port.

Program starts without errors
Inspect logs for errors/warnings/stack traces
Verify output correctness — fetch source data independently, compare against program output. If mismatch → BUG.
Test with real/sample data
Performance check (UI changes): Open the page, monitor for lag or high CPU. Watch for: components rendering expensive operations without useMemo/useCallback, eager loading of all data on mount (lazy-load instead), missing virtualization for large lists, network request storms (N+1 fetches). If page feels sluggish → profile and fix before proceeding.

Bugs: Minor → fix, re-run, continue. Major → add task to plan, set PENDING, loop back to implementation.

Step 3.8: Per-Task DoD Audit

If runtime profile is Minimal: Skip.

For EACH task, verify its Definition of Done criteria against the running program with evidence (command output, API response, screenshot).

If any criterion unmet: fix inline if possible, or add task and loop back.

Step 3.8b: Not Verified Acknowledgment

List what was NOT verified and why. Include in the verification report (Step 3.13). Every gap must have a reason:

Not Verified	Reason
[criterion or scenario]	No test environment / Out of scope / Untestable statically / Deferred

"None — all criteria have automated verification" is a valid answer if true. Do not omit this section: absence of acknowledged gaps ≠ absence of real gaps.

Step 3.9: E2E Verification (Full profile only)

If runtime profile is not Full: Skip.

3.9a: Resolve Browser Tool

Detect Claude Code Chrome: Check if mcp__claude-in-chrome__* tools are in your available/deferred tools list.

If available: Use Claude Code Chrome for all E2E steps below. Load tools via ToolSearch(query="select:mcp__claude-in-chrome__<tool>"). No session isolation needed.
If NOT available: Output the fallback warning (see browser-automation.md), then use agent-browser:

bash

AB_SESSION="${PILOT_SESSION_ID:-default}"
# ALL agent-browser commands use: --session "$AB_SESSION"

3.9b: Check for Structured Scenarios

Read the plan's ## E2E Test Scenarios section (if it exists).

If structured scenarios exist (TS-NNN format): Follow 3.9c below.

If no structured scenarios: Fall back to ad-hoc verification — test the primary user workflow (every view, interaction, state transition), then cover edge cases:

Category	What to test
Empty state	No data, no results
Invalid input	Bad params, wrong types, injection
Stale state	References to deleted data
Error state	Backend unreachable
Boundary	Max values, zero, single item

Then skip to 3.9e (close browser + write results).

3.9c: Execute Structured Scenarios

Create one task per scenario for tracking. Execute Critical first, then High, then Medium.

TaskCreate(subject="TS-NNN: [name]", description="[priority] | [preconditions]")

For each scenario:

TaskUpdate → in_progress
Execute each step using the resolved browser tool:
- Chrome: navigate to open pages, read_page after interactions, computer/form_input per the step's action
- agent-browser (fallback): open/goto to navigate, snapshot -i after interactions, click/fill/press per the step's action
- Verify the expected result by reading the page output
PASS: All steps match expected results → TaskUpdate → completed, note TS-NNN: PASS
FAIL: Step result doesn't match expected:
- Analyze root cause, implement minimal fix, re-run relevant tests (stay in Phase B — no code changes that need re-review)
- Re-execute the scenario (counts as fix attempt 1)
- If still failing: implement second fix, re-execute (fix attempt 2)
- After 2 failed fix attempts: TaskUpdate → completed, note TS-NNN: KNOWN_ISSUE — [description]
Critical KNOWN_ISSUE → set Status: PENDING, increment Iterations, register status change, invoke Skill(skill='spec-implement', args='<plan-path>') — do not proceed to VERIFIED
High/Medium KNOWN_ISSUE → document and continue (non-blocking)

3.9d: Write E2E Results to Plan

After all scenarios are executed, append to the plan file:

markdown

## E2E Results

| Scenario | Priority | Result | Fix Attempts | Notes |
|----------|----------|--------|--------------|-------|
| TS-001   | Critical | PASS   | 0            |       |
| TS-002   | High     | PASS   | 1            | Fixed: missing validation on empty submit |
| TS-003   | Medium   | KNOWN_ISSUE | 2       | Tooltip misaligned on narrow viewport |

3.9e: Close Browser

bash

# Chrome: no explicit close needed (tab remains open)
# agent-browser: agent-browser --session "$AB_SESSION" close

Final

Step 3.10: Final Regression Check

Re-run full test suite + type checker + build one final time. If code changed during Phase B this catches regressions. If no code changed, it confirms Phase A's green state — cheap insurance.

Step 3.11: Worktree Sync (if worktree active)

Extract plan slug from path (strip date prefix and .md)
Check: ~/.pilot/bin/pilot worktree detect --json <plan_slug>
If no worktree: Skip to Step 3.13.
Save plan to project root (only if gitignored):
bash
```
git -C <project_root> check-ignore -q docs/plans/<plan_filename>
```
If exit 0 (ignored): cp <worktree_plan_path> <project_root>/docs/plans/<plan_filename> If exit 1 (tracked): skip — the squash merge will bring the updated plan.
Show diff: ~/.pilot/bin/pilot worktree diff --json <plan_slug>

Notify and ask:

bash

~/.pilot/bin/pilot notify plan_approval "Worktree Sync" "<plan_name> — approve merge" --plan-path "<plan_path>" 2>/dev/null || true

AskUserQuestion: "Yes, squash merge" (Recommended) | "No, keep worktree" | "Discard all changes"

Handle choice:

Squash merge:
bash
```
# ⛔ ALL THREE operations MUST be in ONE Bash call chained with &&
# If sync fails, cleanup MUST NOT run — otherwise work is lost.
~/.pilot/bin/pilot worktree sync --json <plan_slug> && PROJECT_ROOT=$(~/.pilot/bin/pilot worktree cleanup --force --json <plan_slug> | jq -r '.project_root') && cd "$PROJECT_ROOT"
```
⛔ NEVER split sync, cleanup, or cd into separate Bash calls — compaction between them can cause work loss. ⛔ The && chain ensures cleanup only runs after a successful sync.

Keep worktree: Report path, user can sync later. Discard: cleanup --discard + cd in same bash call (no sync needed — --discard explicitly allows deleting unmerged work).

Step 3.12: Post-Merge Verification (after squash merge only)

Mandatory after successful worktree sync. The squash merge can introduce breakage from base branch divergence.

Run full test suite
Run type checker / linter
Build verification
Program launch smoke test

If any fails: fix on base branch, re-run, commit fix separately (e.g., fix: resolve post-merge regression from spec/<slug>).

⛔ Do NOT proceed to Step 3.13 until all post-merge checks pass.

Step 3.12b: Check for Code Review Feedback

Run BEFORE marking VERIFIED. Check if the user has left code review annotations in the Console's Changes tab. Annotations auto-save to the unified JSON — no "Send Feedback" button needed.

Derive the annotation file path: docs/plans/.annotations/<plan-filename>.json (same basename as the plan, .json extension).

Read the annotation file with the Read tool. If the file doesn't exist, treat as NO_FEEDBACK. If it exists, check whether codeReviewAnnotations has any entries (FEEDBACK_EXISTS) or is empty/missing (NO_FEEDBACK).

If FEEDBACK_EXISTS:

Each annotation in codeReviewAnnotations has filePath, lineStart, lineEnd, side, and text (user's annotation)
Fix all issues raised (each annotation = a required fix at the indicated file/line)
Clear code review annotations: curl -s -X DELETE "http://localhost:41777/api/annotations/code-review?path=<encoded-plan-path>" > /dev/null 2>&1 || true
Re-run tests and typecheck
Continue to Step 3.13

If NO_FEEDBACK: continue to Step 3.13.

Step 3.13: Code Review Gate (User Confirmation)

⛔ MANDATORY before marking VERIFIED. All automated checks pass — but the user should review the actual code changes.

⛔ MUST use AskUserQuestion — the stop guard only allows stopping when it detects this tool in the transcript. Plain text output will cause the stop guard to block session exit while waiting for user feedback.

Notify:

bash

~/.pilot/bin/pilot notify plan_approval "Verification Complete — Review Changes" "<plan_name> — please review code in Changes tab" --plan-path "<plan_path>" 2>/dev/null || true

Summarize what was done (brief: changes made, tests passed, issues fixed), then ask:

AskUserQuestion(
  question="All automated checks passed. Please review the code changes in the Console's **Changes** tab.\n\nYou can leave inline annotations on specific lines using the **Review** mode toggle — annotations save automatically.\n\n[brief summary of changes]\n\nChoose an option when done:",
  options=["Approve — mark spec as verified", "Fix — address my annotations from the Console"]
)

Handle response:
- "Approve" / "approve" / "lgtm" / "looks good" → proceed to Step 3.14
- "Fix" / any feedback → re-run Step 3.12b (check for code review annotations in JSON), fix issues, re-run tests, return to Step 3.13
- If user skips ("continue", "proceed") → treat as approval, proceed to Step 3.14

Step 3.14: Update Plan Status

When ALL passes AND user approves:

Set Status: VERIFIED in plan
Register: ~/.pilot/bin/pilot register-plan "<plan_path>" "VERIFIED" 2>/dev/null || true

Report completion with summary:

## Verification Complete
**Issues Found:** X
### Goal Achievement: N/M truths verified
### Must Fix (N) | Should Fix (N) | Suggestions (N)
### Not Verified: [list items from Step 3.8b, or "None"]

Instruct the user: Include in your completion message:

Run /clear before starting new work — this resets context while keeping project rules loaded.

When verification FAILS (missing features, serious bugs — before reaching Step 3.13):

Add fix tasks to plan
Set Status: PENDING, increment Iterations
Register: ~/.pilot/bin/pilot register-plan "<plan_path>" "PENDING" 2>/dev/null || true
Write ## Verification Gaps table to plan (overwrite if exists):
markdown
```
| Gap | Type | Severity | Affected Files | Fix Description |
```
Invoke Skill(skill='spec-implement', args='<plan-path>')

ARGUMENTS: $ARGUMENTS

Maintainer

maxritter Core maintainer

Source details

Full Name: maxritter/pilot-shell
Branch: main
Path in repo: pilot/skills/spec-verify
License: Other
Topics: claude-code anthropic anthropic-claude claude ai-agents ai-coding claudecode model-context-protocol claude-skills claude-ai ai-tools ai-assistant ai-engineering spec-driven-development ai-coding-tools claude-context

Featured Tools

Join Our Newsletter

Spec planning phase - explore codebase, design plan, get approval

1,637 137

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

/spec-verify - Verification Phase

⛔ KEY CONSTRAINTS

Step 0: Read Toggle Configuration

The Process

Step 3.0: Classify Runtime Profile

Phase A: Finalize the Code

Step 3.0b: Clean Up Stale Spec-Review Findings

Step 3.1: Launch Code Review Agent (Early)

3.1a: Gather Context

3.1b: Launch

Codex Adversarial Review (Optional — launch immediately after Claude reviewer)

Step 3.2: Automated Checks

Step 3.3: Feature Parity Check (migration/refactoring only)

Step 3.4: Collect Review Results

Fix Claude Reviewer Findings

Collect Codex Results (if launched)

Step 3.5: Re-verification (Only for Structural Fixes)

Phase B: Verify the Running Program

Step 3.6: Build, Deploy, and Verify Code Identity

3.6a: Build

3.6b: Deploy (if applicable)

3.6c: Code Identity Verification

Step 3.7: Program Execution Verification

Step 3.8: Per-Task DoD Audit

Step 3.8b: Not Verified Acknowledgment

Step 3.9: E2E Verification (Full profile only)

3.9a: Resolve Browser Tool

3.9b: Check for Structured Scenarios

3.9c: Execute Structured Scenarios

3.9d: Write E2E Results to Plan

3.9e: Close Browser

Final

Step 3.10: Final Regression Check

Step 3.11: Worktree Sync (if worktree active)

Step 3.12: Post-Merge Verification (after squash merge only)

Step 3.12b: Check for Code Review Feedback

Step 3.13: Code Review Gate (User Confirmation)

Step 3.14: Update Plan Status

Recommended Agent Skills

setup-rules

spec-bugfix-verify

create-skill

spec-bugfix-plan

spec-implement

spec-plan