Agent skill

spec-tests

Intent-based specification tests evaluated by LLM-as-judge. Use when the user asks to "create spec tests", "write intent tests", "TDD with intent", "natural language tests", or wants tests that capture WHY, not just WHAT. NOT pytest/jest/unittest - natural language specs Claude evaluates.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/spec-tests

SKILL.md

Spec Tests: Intent-Based Testing for LLM Development

Spec tests are intent-based specifications that Claude evaluates as judge. They capture WHY something matters—making them cheat-proof for LLM-driven development.

The TDD Flow

1. Plan       → Define what you're building
2. Spec (red) → Write intent tests (they fail - no implementation yet)
3. Implement  → Build the feature
4. Spec (green) → Tests pass (Claude confirms intent is satisfied)

Test File Format

markdown
# Feature Name

## Test Group

### Test Case Name

Intent statement explaining WHY this test matters. What user need does it serve?
What breaks if this doesn't work?

\`\`\`
Given [precondition]
When [action]
Then [expected outcome]
\`\`\`

Structure: H2 = test group, H3 = test case, intent = required statement, code block = expected behavior.

Critical: Intent statement must appear immediately above the code block, between the H3 header and the assertion block. Section-level intent does not count—each test case needs its own WHY directly before its code block.

Each test must include a fenced code block. Missing code blocks fail with [missing-assertion].


Test Location & Targets

Spec tests live in specs/tests/ and declare their target(s) via frontmatter.

Single target:

markdown
---
target: src/auth.py
---
# Authentication Tests

Multiple targets:

markdown
---
target:
  - src/auth.py
  - src/session.py
---
# Authentication Flow

Directory structure — name files by feature/spec, not by target path:

specs/tests/
  authentication.md      ← target: [src/auth.py, src/session.py]
  intent-requirement.md  ← target: [SKILL.md]
  api-validation.md      ← target: [src/api/validate.py]

Frontmatter is required. Missing target: causes immediate failure with [missing-target].


Running Tests

Copy the runner files to your project:

bash
cp "${CLAUDE_PLUGIN_ROOT}/scripts/run_tests_claude.py" specs/tests/
cp "${CLAUDE_PLUGIN_ROOT}/scripts/judge_prompt.md" specs/tests/

Run tests:

bash
python specs/tests/run_tests_claude.py specs/tests/authentication.md  # Single spec
python specs/tests/run_tests_claude.py specs/tests/                   # All specs
python specs/tests/run_tests_claude.py specs/tests/auth.md --test "Valid Credentials"  # Single test

Uses claude -p (your subscription, no API key needed).

Options:

Flag Purpose
--target FILE Override frontmatter target
--model MODEL Claude model (default: sonnet)
--test "Name" Run only named test

Timeout: 60-300 seconds per test.


Why Intent Matters

LLMs can "game" tests by changing them instead of fixing code.

Without intent (fails with `[missing-intent]}):

markdown
### Completes Quickly
\`\`\`
elapsed < 50ms
\`\`\`

LLM thinks: "50 seems arbitrary, change to 100." User gets laggy editor.

With intent:

markdown
### Completes Quickly

Users perceive delays over 50ms as laggy. This runs on every keystroke.
The 50ms target is a UX requirement, not negotiable.

\`\`\`
Given a keystroke event
When process_keystroke() is called
Then it completes in under 50ms
\`\`\`

Claude-as-judge evaluates: Does it satisfy the UX requirement? Relaxing threshold → [intent-violated].

Intent properties:

  • Required — Missing intent → [missing-intent] before evaluation
  • Per-test — Each test needs its own WHY above the code block
  • Business-focused — Why users/product care, not technical details
  • Evaluative — Catches "legal but wrong" solutions

Reference Files

For detailed patterns, consult:

  • references/evaluation.md — Error codes, response format, strictness rules, alternative runners, template variables
  • references/multi-target.md — Writing tests for multiple targets, multi-file Given syntax
  • references/examples.md — Complete examples, porting tests across languages
  • references/meta-content.md — Testing prompt files and directive-like content

Checklist

  • Each test has intent statement explaining WHY
  • Intent is business/user focused
  • Expected behavior is clear
  • Each test includes a fenced assertion code block
  • One behavior per test case
  • Multi-target specs: each test starts with Given the <target> file

Missing intent = immediate failure. The runner rejects tests without intent statements before evaluating behavior.

Didn't find tool you were looking for?

Be as detailed as possible for better results