Agent skill

dev-verify

This skill should be used when the user asks to 'verify completion', 'check that tests pass', 'confirm feature works', or REQUIRED Phase 7 of /dev workflow (final). Enforces fresh runtime evidence before claiming completion.

View SKILL.md on GitHub Repository

Stars 6

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/edwinhu/workflows/tree/main/skills/dev-verify

SKILL.md

Announce: "Using dev-verify (Phase 7) to confirm completion with fresh evidence."

Load shared enforcement:

Read ${CLAUDE_SKILL_DIR}/../../references/constraints/dev-common-constraints.md.

The Iron Law of Verification
Red Flags - STOP Immediately If You Think
The Gate Function
Claims Requiring Evidence
Insufficient Evidence
Verification Patterns
User Acceptance (Final Step)
Bottom Line

Verification Gate

The automated test IS your deliverable. The implementation just makes the test pass.

Reframe your task:

❌ "Implement feature X, and test it"
✅ "Write an automated test that proves feature X works. Then make it pass."

The test proves value. The implementation is a means to an end.

Without a REAL automated test (executes code, verifies behavior), you have delivered NOTHING. </EXTREMELY-IMPORTANT>

NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE. This is not negotiable.

Before claiming ANYTHING is complete, you MUST:

IDENTIFY - Which command proves your assertion?
RUN - Execute the command fresh (not from cache/memory)
READ - Review full output and exit codes
VERIFY - Confirm output supports your claim
Only THEN make the claim

This applies even when:

"I just ran it a moment ago"
"The agent said it passed"
"It should work"
"I'm confident it's fine"

If you catch yourself about to claim completion without fresh evidence, STOP. </EXTREMELY-IMPORTANT>

Red Flags - STOP Immediately If You Catch Yourself Thinking:

Thought	Why It's Wrong	Do Instead
"It should work"	"Should" isn't evidence	Run the command
"I'm pretty sure it passes"	Confidence isn't verification	Run the command
"The agent reported success"	Agent reports need confirmation	Run it yourself
"I ran it earlier"	Earlier isn't fresh	Run it again
"The code exists"	Existing ≠ working	Run and check output
"Grep shows the function"	Pattern match ≠ runtime test	Run the function

The Gate Function

Checkpoint type: decision (user confirms requirements met — cannot auto-advance)

Before making ANY status claim:

1. IDENTIFY → Which command proves your assertion?
2. RUN     → Execute the command fresh
3. READ    → Review full output and exit codes
4. VERIFY  → Confirm output supports your claim
5. CLAIM   → Only after steps 1-4

Skipping any step is not verification — it's shipping unverified work the user will have to debug.

Claims Requiring Evidence

Claim	Required Evidence
"Tests pass"	Test output showing 0 failures
"Build succeeds"	Exit code 0 from build command
"Linter clean"	Linter output showing 0 errors
"Bug fixed"	Test that failed now passes
"Feature complete"	All acceptance criteria verified
"User-facing feature works"	E2E test output showing PASS

USER-FACING CLAIMS REQUIRE E2E EVIDENCE. Unit tests are insufficient.

Claim	Unit Test Evidence	E2E Evidence Required
"API works"	❌ Insufficient	✅ Full request/response test
"UI renders"	❌ Insufficient	✅ Playwright snapshot/interaction
"Feature complete"	❌ Insufficient	✅ User flow simulation
"No regressions"	❌ Insufficient	✅ E2E suite passes

Fake E2E Patterns - STOP

These are NOT E2E tests. They are observability, not verification.

❌ Fake E2E	✅ Real E2E
"Log shows function was called"	"Screenshot shows correct UI rendered"
"grep papirus in logs"	"grim screenshot + visual diff confirms icon changed"
"Console output contains 'success'"	"Playwright assertion: element.textContent === 'Success'"
"File was created"	"E2E test opens file and verifies contents"
"Process exited 0"	"Functional test verifies actual output matches spec"
"Mock returned expected value"	"Real integration returns expected value"

Red Flag: If you catch yourself thinking "logs prove it works" - STOP, you're about to claim false verification. Logs prove code executed, not that it produced correct results. E2E means verifying the actual output users see.

Rationalization Prevention (E2E)

Thought	Reality
"Unit tests cover it"	Unit tests don't simulate users. Where's YOUR E2E?
"E2E would be redundant"	YOU'LL catch bugs with redundancy. Write E2E.
"No time for E2E"	YOU don't have time to fix production bugs? Write E2E.
"Feature is internal"	Does it affect user output? Then YOU need E2E.
"I manually tested"	YOU provided no evidence. Automate it.
"Log checking verifies it works"	YOUR log checking only verifies code executed, not results. Not E2E.
"E2E with screenshots is too complex"	If YOU can't verify it simply, your feature isn't done. Complexity = bugs hiding.
"Implementation is done, testing is just verification"	Testing IS YOUR implementation. Untested code is unfinished code.

The E2E Gate Function

For user-facing changes, add to verification:

1. IDENTIFY → Which E2E test proves user-facing behavior?
2. RUN     → Execute E2E test fresh
3. READ    → Review full output (screenshots if visual)
4. VERIFY  → User flow works as specified
5. CLAIM   → Only after E2E evidence exists

"Unit tests pass" is not "feature complete" for user-facing changes.

GUI Application Gate (CRITICAL)

GATE 1: BUILD
GATE 2: LAUNCH (with file-based logging)
GATE 3: WAIT
GATE 4: CHECK PROCESS
GATE 5: READ LOGS ← MANDATORY, CANNOT SKIP
GATE 6: VERIFY LOGS
THEN AND ONLY THEN: E2E tests/screenshots

You cannot skip GATE 5 (READ LOGS). If you catch yourself about to take screenshots without reading logs first, STOP.

For the full gate sequence with examples, discover and read skills/dev-tdd/SKILL.md via cache lookup. </EXTREMELY-IMPORTANT>

If verification discovers stale or fabricated evidence in LEARNINGS.md, DELETE the contaminated entries. Do not amend false claims — remove them entirely and re-run the verification from scratch.

Insufficient Evidence

These do NOT count as verification:

Previous runs (must be fresh)
Assumptions ("it should work")
Partial checks (ran some tests, not all)
Agent reports without independent confirmation
"I think..." / "It seems..." / "Probably..."

Drive-Aligned Framing

When you say "Feature complete", you are asserting:

You ran the verification commands yourself (fresh)
You saw the output with your own tokens
The output confirms the claim

Saying "complete" based on stale data or agent reports is not "summarizing" - it is creating a false sense of completion that wastes the user's time.

"Still verifying" protects the user. "Complete" without evidence creates rework. </EXTREMELY-IMPORTANT>

Rationalization Prevention

These thoughts mean STOP—you're about to claim falsely:

Thought	Reality
"I just ran it"	"Just" = stale. YOU must run it AGAIN.
"The agent said it passed"	Agent reports need YOUR confirmation. YOU run it.
"It should work"	"Should" is hope. YOU run and see output.
"I'm confident"	YOUR confidence ≠ verification. YOU run the command.
"We already verified earlier"	Earlier ≠ now. YOU need fresh evidence only.
"User will verify it"	NO. YOU verify before claiming. User trusts YOUR claim.
"Close enough"	Close ≠ complete. YOU verify fully.
"Time to move on"	YOU only move on after FRESH verification.

STRUCTURAL VERIFICATION IS NOT RUNTIME VERIFICATION:

❌ NOT Verification	✅ IS Verification
"Code exists in file"	"Code ran and produced output X"
"Function is defined"	"Function was called and returned Y"
"Grep found the pattern"	"Program output shows expected behavior"
"ast-grep found the code"	"Test executed and passed with output"
"Diff shows the change"	"Change tested with actual input/output"
"Implementation looks correct"	"Ran test, saw PASS in logs"

The key difference:

Structural: "The code IS THERE" (useless)
Runtime: "The code WORKS" (valid)

If you find yourself saying "the code exists" or "I verified the implementation" without running it, STOP - you're doing structural analysis, not verification.

Why Skipping Hurts the Thing You Care About Most

Your Drive	Why You Skip	What Actually Happens	The Drive You Failed
Helpfulness	"Report 'verified' to unblock the user"	The user discovers the failure — your shortcut created rework	Anti-helpful
Competence	"I'm confident it works without running it"	Confidence without evidence is delusion, not competence	Incompetent
Efficiency	"Prior tests still pass, skip fresh evidence"	They don't — your assumption is the bug the user discovers	Inefficient

The protocol is not overhead you pay. It is the service you provide.

Verification Patterns

Tests

bash

# Run tests (e.g., npm test, pytest, cargo test)
npm test

# Check results: "34/34 pass" = can claim tests pass
# "33/34 pass" = cannot claim success (partial fail)

Tool description: Run automated test suite to verify all tests pass

Regression Test

bash

# 1. Write test → run (should fail initially)
# 2. Apply fix → run (should pass)
# 3. Revert fix → run (must fail again to confirm fix)
# 4. Restore fix → run (must pass to confirm success)

Tool description: Execute regression test cycle to validate bug fix reproducibility

Build

bash

npm run build && echo "Exit code: $?"
# Must see "Exit code: 0" to claim success

Tool description: Build application and verify exit code is 0

Goal-Backward Verification (Subagent)

After technical tests pass, spawn the dev-verifier agent to check that phase GOALS were achieved, not just tasks completed:

Tool Restrictions: The verifier is READ-ONLY. It runs tests via Bash and reads output but MUST NOT use Write or Edit.

Agent(subagent_type="workflows:dev-verifier",
      allowed_tools=["Read", "Glob", "Grep", "Bash(read-only)"],
      prompt="""
Verify that the dev workflow goals have been achieved for this feature.

**Tool Restrictions:** You are READ-ONLY. You MUST NOT use Write or Edit tools. You run tests and read output to verify goals — you do NOT modify code. If you find gaps, you report them — the main chat fixes them.

Read these files:
- .planning/SPEC.md (requirements and success criteria)
- .planning/PLAN.md (implementation plan)
- .planning/STATE.md (workflow state)

For each success criterion in SPEC.md, verify with FRESH runtime evidence that the goal was met.
Task completion ≠ goal achievement. A file existing ≠ feature working.

**Trace to Requirements:** For each success criterion, reference its requirement ID (e.g., "AUTH-01: Login returns JWT token — VERIFIED with test output showing..."). This creates end-to-end traceability from SPEC.md through PLAN.md through VALIDATION.md through verification.

Report:
- GOAL: [from SPEC.md success criteria]
- REQUIREMENT: [REQ-ID from SPEC.md]
- STATUS: MET | NOT_MET | PARTIAL
- EVIDENCE: [fresh runtime output proving it]

If ANY goal is NOT_MET, list the specific gaps.
""")

If dev-verifier finds gaps: Return to dev-implement to address them before proceeding to user acceptance. If all goals MET: Proceed to user acceptance below.

User Acceptance (Final Step)

Checkpoint type: decision (user confirms completion — cannot auto-advance)

After technical verification and goal-backward verification pass, confirm with user. Use the AskUserQuestion pattern:

Tool description: Request user confirmation that implementation meets specified requirements

yaml

question: "Does this implementation meet your requirements?"
options:
  - label: "Yes, requirements met"
    description: "Feature works as designed, ready to merge"
  - label: "Partially"
    description: "Core works but missing some requirements"
  - label: "No"
    description: "Does not meet requirements, needs more work"

Reference .planning/SPEC.md when asking—remind user of the success criteria they defined.

If user responds "Partially" or "No":

Ask which specific requirement is not met
Return to /dev-implement to address gaps
MANDATORY: Re-invoke dev-verify after fixes — do not skip re-verification

NO COMPLETION CLAIMS WITHOUT RE-VERIFICATION AFTER USER FEEDBACK. This is not negotiable.

If the user says "Partially" or "No":

Track iteration in .planning/VERIFY_STATE.md:

yaml

iteration: 1
max_iterations: 3
user_feedback: "Partially - missing X"

Return to dev-implement for targeted fixes
Re-invoke dev-verify (re-read this skill from scratch)
Re-run ALL verification gates (not just the failed one)
Re-ask user acceptance question

Escalation: After 3 iterations without "Yes", escalate to user:

"We've iterated 3 times without full acceptance. Should we continue, descope, or take a different approach?"

Claiming 'verified' after user said 'Partially' without re-running verification is NOT HELPFUL — you're telling the user their problem is solved when it isn't. </EXTREMELY-IMPORTANT>

Only claim COMPLETE when:

All technical tests pass (automated)
User confirms requirements met (manual)
If re-verification: iteration tracked and all gates re-run

Bottom Line

Two types of verification required:

Technical - Run commands, see output, confirm no errors
Requirements - Ask user if it does what they wanted

Both must pass. No shortcuts exist.

Workflow Complete

When user confirms "Yes, requirements met":

Announce: "Dev workflow complete. All 7 phases passed."

The /dev workflow is now finished. Offer to:

Commit the changes
Clean up .planning/ files
Start a new feature with /dev

Key Principles

Fresh Evidence Always: Every claim requires proof from a fresh command execution, not cached results or agent reports.

Runtime Over Structural: Verify code works by running it, not by checking if code exists. Structural analysis cannot prove behavior.

E2E for User-Facing: User-visible features require end-to-end evidence (screenshots, user flow tests), not unit tests alone.

Drive-Aligned Framing: Claiming completion without fresh evidence creates rework for the user. Only advance when fully verified.

Maintainer

edwinhu Core maintainer

Source details

Full Name: edwinhu/workflows
Branch: main
Path in repo: skills/dev-verify

Featured Tools

Join Our Newsletter

dev-verify

Install this agent skill to your Project

SKILL.md

Contents

Verification Gate

Red Flags - STOP Immediately If You Catch Yourself Thinking:

The Gate Function

Claims Requiring Evidence

Fake E2E Patterns - STOP

Rationalization Prevention (E2E)

The E2E Gate Function

GUI Application Gate (CRITICAL)

Insufficient Evidence

Drive-Aligned Framing

Rationalization Prevention

Why Skipping Hurts the Thing You Care About Most

Verification Patterns

Tests

Regression Test

Build

Goal-Backward Verification (Subagent)

User Acceptance (Final Step)

Bottom Line

Workflow Complete

Key Principles

Recommended Agent Skills

audit-fix-loop

ds-spec-reviewer

pptx-render

obsidian-organize

dev

workflow-creator