Agent skill
ds-verify
This skill should be used when the user asks to 'verify analysis results', 'check reproducibility', 'validate data science output', 'confirm completion', or as Phase 5 of the /ds workflow.
Install this agent skill to your Project
npx add-skill https://github.com/edwinhu/workflows/tree/main/skills/ds-verify
SKILL.md
Announce: "Using ds-verify (Phase 5) to confirm reproducibility and completion."
Contents
- The Iron Law of DS Verification
- Red Flags - STOP Immediately If You Think
- The Verification Gate
- Verification Checklist
- Reproducibility Demonstration
- Claims Requiring Evidence
- Insufficient Evidence
- Required Output Structure
- Completion Criteria
Context Monitoring
| Level | Remaining Context | Action |
|---|---|---|
| Normal | >35% | Proceed normally |
| Warning | 25-35% | Complete current review cycle, then trigger ds-handoff |
| Critical | ≤25% | Immediately trigger ds-handoff — do not start new review cycles |
Verification Gate
Final verification with reproducibility checks and user acceptance interview.
NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION. This is not negotiable.
Load shared enforcement first. Read ${CLAUDE_SKILL_DIR}/../../references/constraints/ds-common-constraints.md for the full constraint index.
For verification phase, load these specific constraints:
Read ${CLAUDE_SKILL_DIR}/../../references/constraints/ds-post-subagent-boundary.md
Load conventions for verification phase:
Read ${CLAUDE_SKILL_DIR}/../../references/constraints/ds-common-conventions.md for the full convention index.
Read ${CLAUDE_SKILL_DIR}/../../references/conventions/ds-assumption-over-evidence.md
Read ${CLAUDE_SKILL_DIR}/../../references/conventions/ds-deferred-verification.md
Before claiming analysis is complete, you MUST:
- RE-RUN - Execute analysis fresh (not cached results)
- CHECK - Verify outputs match expectations
- REPRODUCE - Confirm results are reproducible
- ASK - Interview user about constraints and acceptance
- Only THEN claim completion
This applies even when:
- "I just ran it"
- "Results look the same"
- "It should reproduce"
- "User seemed happy earlier"
If you catch yourself thinking "I can skip verification," STOP — you're about to deliver unverified results that waste the user's time. </EXTREMELY-IMPORTANT>
Drive-Aligned Framing
| Drive | Shortcut | Consequence |
|---|---|---|
| Helpfulness | Skipping fresh re-run | You assumed prior results still hold. They don't reproduce — the user publishes irreproducible work. Your assumption is the error they discover. Anti-helpful. |
| Competence | Verifying your own work | You ran the reproducibility check yourself instead of dispatching a fresh agent. You share the implementer's biases. A fresh agent would have caught the issue. Incompetent verification. |
| Efficiency | Not running reproducibility check | You skipped the 10-minute check. The irreproducible results take 10 days to debug when someone else tries to run them. Anti-efficient. |
| Approval | Skipping user acceptance interview | You declared completion without asking the user. They discover the results don't answer their question. They now require manual review of all analysis. Lost approval. |
| Honesty | Rubber-stamping verification | You reported 'verified' without re-executing. The analysis fails on fresh data — your unverified claim wastes the user's time. |
Rationalization Table
| Excuse | Reality | Do Instead |
|---|---|---|
| "The results matched before" | Prior results don't prove current reproducibility. Code, data, or environment may have changed. | Re-run fresh and compare outputs |
| "I just need to check the numbers" | Reproducibility means re-running, not re-reading. Reading cached output proves nothing. | Execute the analysis fresh and verify outputs match |
| "The reviewer already verified this" | Review checks methodology, verify checks reproducibility. They are different gates. | Run the reproducibility demonstration yourself |
| "Fresh re-run will give same results" | If you're sure, running it costs nothing. If you're wrong, skipping it costs everything. | Run it. Proof is cheap, assumptions are expensive. |
| "The user is waiting" | Publishing irreproducible results wastes more time than verification. A 10-minute check prevents a 10-day retraction. | Run verification now — the user wants correct results, not fast wrong ones |
Red Flags - STOP Immediately If You Think:
| Thought | Why It's Wrong | Do Instead |
|---|---|---|
| "Results should be the same" | Your "should" isn't verification | Re-run and compare |
| "I ran it earlier" | Your earlier run isn't fresh | Run it again now |
| "It's reproducible" | Your claim requires evidence | Demonstrate reproducibility |
| "User will be happy" | Your assumption isn't their acceptance | Ask explicitly |
| "Outputs look right" | Your visual inspection isn't verified | Check against criteria |
Static Analysis (Constraint Check Scripts)
Before running runtime DQ checks, run the static analysis constraint check suite:
bash "${CLAUDE_SKILL_DIR}/../../scripts/check-all-ds.sh" "$(pwd)"
This runs all DS constraint check scripts (determinism, join audits, idempotency, error handling, schema contracts, standard errors, visualization integrity).
If any check FAILS: Report the failures in LEARNINGS.md. These are code quality issues in the analysis scripts that must be fixed before proceeding. Dispatch a fix subagent if needed.
If all checks PASS: Proceed to runtime DQ checks.
The Verification Gate
Checkpoint type: decision (user confirms results — cannot auto-advance)
Before making ANY completion claim:
1. RE-RUN → Execute fresh, not from cache
2. CHECK → Compare outputs to success criteria
3. REPRODUCE → Same inputs → same outputs
4. ASK → User acceptance interview
5. CLAIM → Only after steps 1-4
Skipping any step is not verification.
Visual Diagnostics for Verification
When presenting verification results to the user in the acceptance interview, generate diagnostic plots to support the decision:
| Verification Check | Diagnostic to Generate |
|---|---|
| Reproducibility comparison | Overlay plot of Run 1 vs Run 2 key outputs |
| Data integrity | Pipeline waterfall chart (input rows → cleaning → joins → final) |
| Distribution sanity | Histogram/density plots of key variables with expected ranges annotated |
| Model performance | ROC curve, residual plot, or coefficient comparison (as appropriate) |
Format: Inline plots in notebooks, or saved to scratch/diagnostics/ for script-based workflows. Present alongside the acceptance interview questions.
Verification Checklist
Technical Verification
Outputs Match Expectations
- All required outputs generated
- Output formats correct (files, figures, tables)
- Numbers are reasonable (sanity checks)
- Visualizations render correctly
Reproducibility Confirmed
- Ran analysis twice, got same results
- Random seeds produce consistent output
- No dependency on execution order
- Environment documented (packages, versions)
Data Integrity
- Input data unchanged
- Row counts traceable through pipeline
- No silent data loss or corruption
Trace to Requirements: For each success criterion, reference its requirement ID (e.g., "DATA-01: Panel has 50K+ firm-years — VERIFIED with df.shape output"). End-to-end traceability from SPEC.md through PLAN.md through VALIDATION.md through verification.
User Acceptance Interview
CRITICAL: Before claiming completion, conduct user interview.
Step 1: Replication Constraints
AskUserQuestion:
question: "Were there specific methodology requirements I should have followed?"
options:
- label: "Yes, replicating existing analysis"
description: "Results should match a reference"
- label: "Yes, required methodology"
description: "Specific methods were mandated"
- label: "No constraints"
description: "Methodology was flexible"
If replicating:
- Ask for reference to compare against
- Verify results match within tolerance
- Document any deviations and reasons
Step 2: Results Verification
AskUserQuestion:
question: "Do these results answer your original question?"
options:
- label: "Yes, fully"
description: "Analysis addresses the core question"
- label: "Partially"
description: "Some aspects addressed, others missing"
- label: "No"
description: "Does not answer the question"
If "Partially" or "No":
- Ask which aspects are missing
- Return to
/ds-implementto address gaps - Re-run verification
Step 3: Output Format
AskUserQuestion:
question: "Are the outputs in the format you need?"
options:
- label: "Yes"
description: "Format is correct"
- label: "Need adjustments"
description: "Format needs modification"
Step 4: Confidence in Results
AskUserQuestion:
question: "Do you have any concerns about the methodology or results?"
options:
- label: "No concerns"
description: "Comfortable with approach and results"
- label: "Minor concerns"
description: "Would like clarification on some points"
- label: "Major concerns"
description: "Significant issues need addressing"
Reproducibility Demonstration
MANDATORY: Demonstrate reproducibility before completion.
You MUST NOT verify your own work. Spawn a fresh Task agent for reproducibility.
The implementer shares biases and sunk-cost attachment. A fresh subagent sees only the spec and outputs — it verifies without context pollution.
If you're about to re-run the analysis yourself, STOP. Dispatch a Task agent. </EXTREMELY-IMPORTANT>
Dispatch a fresh Task agent to run the reproducibility check:
All paths below are relative to this skill's base directory.
Agent(subagent_type="general-purpose",
allowed_tools=["Read", "Glob", "Grep", "Bash(read-only)"],
prompt="""
# Reproducibility Verification
**Tool Restrictions:** The verifier is READ-ONLY. It re-runs analyses and checks output but MUST NOT modify notebooks, scripts, or code. It MUST NOT use Write or Edit.
Verify this analysis produces consistent results from a fresh run.
## Context
- Read .planning/SPEC.md for objectives and success criteria
- Read .planning/PLAN.md for expected outputs
- Read .planning/LEARNINGS.md for pipeline documentation
## Shared Checks
Read the shared check definitions:
Read `${CLAUDE_SKILL_DIR}/../../skills/ds-implement/references/ds-checks.md` and follow its instructions.
Run checks: DQ1-DQ4, DQ6, M1, R1
## Reproducibility Protocol
### For scripts:
```python
# Run 1
result1 = run_analysis(seed=42)
hash1 = hash(str(result1))
# Run 2
result2 = run_analysis(seed=42)
hash2 = hash(str(result2))
# Verify
assert hash1 == hash2, "Results not reproducible!"
print(f"Reproducibility confirmed: {hash1} == {hash2}")
For notebooks:
jupyter nbconvert --execute --inplace notebook.ipynb
papermill notebook.ipynb output.ipynb -p seed 42
Required Checks
- RE-RUN: Execute analysis fresh (not cached results)
- CHECK: Verify outputs match SPEC.md success criteria
- REPRODUCE: Same inputs → same outputs (run twice, compare hashes)
- DATA INTEGRITY: Input data unchanged, row counts traceable
Output
Report:
- Reproducibility: PASS/FAIL (with hash comparison)
- Data quality checks: DQ1-DQ4, DQ6 results
- Spec compliance: M1 result
- Any discrepancies found """)
**Post-subagent boundary (C5):** After verification agent returns, read its report only. Do NOT read source code, notebooks, or data files yourself. If FAIL, dispatch a fresh investigation subagent.
**If Task agent reports FAIL:** Dispatch a fresh Task agent to investigate the discrepancy. Do NOT investigate yourself — that violates the post-subagent boundary (C5 from ds-common-constraints.md).
## Claims Requiring Evidence
| Claim | Required Evidence |
|-------|-------------------|
| "Analysis complete" | All success criteria verified |
| "Results reproducible" | Same output from fresh run |
| "Matches reference" | Comparison showing match |
| "Data quality handled" | Documented cleaning steps |
| "Methodology appropriate" | Assumptions checked |
## Insufficient Evidence
These do NOT count as verification:
- Previous run results (must be fresh)
- "Should be reproducible" (demonstrate it)
- Visual inspection only (quantify where possible)
- Single run (need reproducibility check)
- Skipped user acceptance (must ask)
## Required Output Structure
```markdown
## Verification Report: [Analysis Name]
### Technical Verification
#### Outputs Generated
- [ ] Output 1: [location] - verified [date/time]
- [ ] Output 2: [location] - verified [date/time]
#### Reproducibility Check
- Run 1 hash: [value]
- Run 2 hash: [value]
- Match: YES/NO
#### Environment
- Python: [version]
- Key packages: [list with versions]
- Random seed: [value]
### User Acceptance
#### Replication Check
- Constraint: [none/replicating/required methodology]
- Reference: [if applicable]
- Match status: [if applicable]
#### User Responses
- Results address question: [yes/partial/no]
- Output format acceptable: [yes/needs adjustment]
- Methodology concerns: [none/minor/major]
### Verdict
**COMPLETE** or **NEEDS WORK**
[If COMPLETE]
- All technical checks passed
- User accepted results
- Reproducibility demonstrated
[If NEEDS WORK]
- [List items requiring attention]
- Recommended next steps
Workflow Loops (If NEEDS WORK)
- Identify which item(s) need fixing
- Return to ds-implement with specific task(s) to fix
- Re-run those tasks with output-first verification
- Update LEARNINGS.md with fixes
- Re-invoke ds-verify for fresh verification
Maximum 3 verification cycles. If issues persist after 3 rounds, escalate to user with summary of blocking issues.
Chaining instruction (if NEEDS WORK). Discover and load ds-implement:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md and follow its instructions.
Then fix the identified issues and re-run verification.
Completion Criteria
Only claim COMPLETE when ALL are true:
- All success criteria from SPEC.md verified
- Results reproducible (demonstrated, not assumed)
- User confirmed results address their question
- User has no major concerns
- Outputs in acceptable format
- If replicating: results match reference
Both technical and user acceptance must pass. No shortcuts.
Workflow Complete
When user confirms all criteria are met:
Announce: "DS workflow complete. All 5 phases passed."
The /ds workflow is now finished. Offer to:
- Export results to final format
- Clean up
.planning/files - Start a new analysis with
/ds
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
audit-fix-loop
This skill should be used when the user asks to 'iteratively improve', 'audit and fix', 'hill-climb quality', 'grade and improve', 'score and fix', 'audit loop', 'quality loop', or needs structured iterative improvement of an artifact using scored independent audits. Also use when the user invokes a ralph loop for quality improvement rather than task completion.
ds-spec-reviewer
Internal skill used by ds-brainstorm at Phase 1 exit gate. Dispatches a reviewer subagent to verify SPEC.md completeness before planning. NOT user-facing.
pptx-render
Use when the user asks to "render pptx", "show pptx slide", "compare with pptx", "pptx to image", "export pptx slide", "original slide", "show me the original", "what does the pptx look like", or needs to extract a specific PPTX slide's content for visual comparison.
obsidian-organize
Organize Obsidian notes according to clawd's preferences. Use when user asks to "organize notes", "move notes to right folder", "clean up vault", "tidy vault", "file this note", or when creating new notes in the Obsidian vault. Also use when moving, renaming, or categorizing notes, or when the vault root has stray files.
dev-verify
This skill should be used when the user asks to 'verify completion', 'check that tests pass', 'confirm feature works', or REQUIRED Phase 7 of /dev workflow (final). Enforces fresh runtime evidence before claiming completion.
dev
This skill should be used when the user asks to 'start a feature', 'build a feature', 'implement a feature', 'develop', 'new feature', or needs the full 7-phase development workflow with TDD enforcement.
Didn't find tool you were looking for?