Agent skill

judge

Scoring framework for test-kitchen cookoff and omakase-off. Invoked at Phase 4 to evaluate implementations using 5-criteria scoring. Do not invoke directly - called by cookoff/omakase-off.

View SKILL.md on GitHub Repository

Stars 36

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/2389-research/claude-plugins/tree/main/test-kitchen/skills/judge

SKILL.md

Test Kitchen Judge

Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.

Terminology: This skill uses "impl" but works for both:

Cookoff: impl-1, impl-2, impl-3 (same design, different implementations)
Omakase: variant-a, variant-b (different approaches/designs)

REQUIRED OUTPUT FORMAT

You MUST produce this exact structure. Do not summarize or abbreviate.

markdown

## Gate Check
| Impl | Tests Pass | Design Adherence |
|------|------------|------------------|
| impl-1 | X/X ✓ or ✗ | Yes/No |
| impl-2 | X/X ✓ or ✗ | Yes/No |

## Feasibility Check
| Impl | Status | Notes |
|------|--------|-------|
| impl-1 | ✓ OK / ⚠️ Flag | Details |
| impl-2 | ✓ OK / ⚠️ Flag | Details |

## Scoring Worksheet

### impl-1
**Fitness for Purpose** (Does it solve the actual problem?)

*Functional requirements:*
- [ ] Primary use case works end-to-end?
- [ ] All explicitly stated requirements implemented?
- [ ] Handles realistic scenarios, not just happy path?

*User needs (beyond literal requirements):*
- [ ] Would the user actually use this, or just demo it?
- [ ] Does it solve the real problem, not just the literal request?
- [ ] Does deployment/distribution match stated needs?

*Future considerations (if relevant):*
- [ ] If growth/scaling mentioned, does architecture support it?
- [ ] If team/collaboration mentioned, is it maintainable by others?

Checklist: _/8 YES → **Score: _/5** (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1)
*Note: Not all items apply to every project. Score based on relevant items.*

**Justified Complexity** (Every line earning its keep?)
- Unnecessary abstractions: ___
- Dead code: ___
- Bloat estimate: ___%

*Line count comparison (if multiple impls):*
- This impl: ___ lines
- Smallest impl: ___ lines
- Extra lines justified by: ___

→ **Score: _/5** (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%)

**Readability** (Understand core flow in 5 min?)
Violations:
- [ ] Single-letter vars (not loop index): +1 each = __
- [ ] Functions >50 lines: +1 each = __
- [ ] Nesting >3 levels: +1 each = __
- [ ] Magic numbers: +1 each = __
- [ ] Bad function names: +1 each = __
Total violations: __ → **Score: _/5** (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1)

**Robustness & Scale** (Handles unexpected + growth?)
- [ ] Input validation?
- [ ] External call error handling?
- [ ] Useful error messages?
- [ ] Null/empty handling?
- [ ] Async timeouts?
- [ ] No unbounded loops?
- [ ] O(n log n) or better?
- [ ] Bounded memory?
- [ ] Queries paginated?
- [ ] No blocking I/O in hot path?
- [ ] Backoff/retry logic?
- [ ] Handles 10x load?
Checklist: _/12 YES + feasibility flags → **Score: _/5**
(11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1)

**Maintainability** (Pain of next change?)
- [ ] Single responsibility per function?
- [ ] Explicit dependencies (no globals)?
- [ ] Business logic separated from infra?
- [ ] New feature = ≤3 files changed?
- [ ] Config externalized?
- [ ] Tests catch regressions?
Checklist: _/6 YES → **Score: _/5** (6=5, 5=4, 4=3, 2-3=2, 0-1=1)

### impl-2
[REPEAT SAME FORMAT]

### impl-3 (if applicable)
[REPEAT SAME FORMAT]

## Judge Scorecard
| Criterion | impl-1 | impl-2 | impl-3 | Best |
|-----------|--------|--------|--------|------|
| Fitness for Purpose | | | | |
| Justified Complexity | | | | |
| Readability | | | | |
| Robustness & Scale | | | | |
| Maintainability | | | | |
| **TOTAL** | /25 | /25 | /25 | |

## Hard Gates
| Gate | Result |
|------|--------|
| Fitness Gate (Δ ≥ 2) | Triggered/Not triggered |
| Critical Flaw (any = 1) | Triggered/Not triggered |

## Winner Selection
**Winner: impl-X** (Score: __/25)

**Selection rationale:**
[2-3 sentences explaining WHY this implementation won]

**Trade-offs acknowledged:**
[What the other implementations did better]

Scoring Reference

Scores Meaning

Score	Meaning
5	Excellent - exceeds expectations
4	Good - fully meets requirements
3	Adequate - core works, some gaps
2	Poor - significant issues
1	Critical flaw - disqualifying

Hard Gates (Automatic)

Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately
Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED

Fitness Gate Interpretation

The Fitness Gate triggers the same way in both contexts, but means different things:

Context	What Fitness Δ ≥ 2 Means
Cookoff	One implementation deviated from or misunderstood the design. All impls should have similar Fitness since they're implementing the same spec. A large gap is a red flag.
Omakase	One approach genuinely solves the problem better. Different approaches can legitimately have different Fitness. A large gap means one approach is clearly superior.

In both cases, higher Fitness wins. The interpretation just explains why the gap exists.

Feasibility Red Flags

Check before scoring:

O(n²) or worse on unbounded data
Unbounded memory growth
Self-DDoS patterns (polling, no backoff)
Missing pagination
Blocking I/O in hot path
No error recovery

Process

Read all implementation code (should already be in context)
Fill out the worksheet for EACH implementation - do not skip sections
Check hard gates
Announce winner with rationale

CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.

CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.

Maintainer

2389-research Core maintainer

Source details

Full Name: 2389-research/claude-plugins
Branch: main
Path in repo: test-kitchen/skills/judge

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

2389-research/claude-plugins

css-development

This skill should be used when working with CSS, creating components, styling elements, refactoring styles, or reviewing CSS code. Triggers on "CSS", "styles", "Tailwind", "dark mode", "component styling", "semantic class", "@apply", "stylesheet". Routes to specialized sub-skills for creation, validation, or refactoring.

36 1

Explore

2389-research/claude-plugins

css-development:create-component

This skill should be used when creating new styled components or adding new CSS classes. Triggers on "create component", "new button", "new card", "add styles", "style component", "build UI element". Guides semantic naming, Tailwind composition, dark mode support, and test coverage.

36 1

Explore

2389-research/claude-plugins

css-development:refactor

This skill should be used when refactoring existing CSS from inline styles or utility classes to semantic patterns. Triggers on "refactor CSS", "extract styles", "consolidate CSS", "convert inline", "clean up styles", "migrate to semantic". Transforms to semantic classes with dark mode and tests.

36 1

Explore

2389-research/claude-plugins

css-development:validate

This skill should be used when reviewing or auditing existing CSS code for consistency with established patterns. Triggers on "review CSS", "audit styles", "check CSS", "validate stylesheet", "CSS review". Checks semantic naming, dark mode coverage, Tailwind usage, and test coverage.

36 1

Explore

2389-research/claude-plugins

ceo-personal-os

This skill should be used when building a personal productivity or operating system for a CEO, founder, or executive. Triggers on "personal OS", "annual review", "life planning", "goal setting system", "Bill Campbell", "Trillion Dollar Coach", "startup failure patterns", "Good to Great", "Level 5 Leadership", "Buy Back Your Time", "E-Myth", "Customer Development", "Steve Blank", "Small Is Beautiful", "Schumacher", "human-scale", "subsidiarity", "Buddhist economics", "permanence".

36 1

Explore

2389-research/claude-plugins

gtm-partner

Strategic go-to-market partner that recommends channels, validates strategy with the user, and generates only the assets that matter. Use when a user has a validated business idea and needs tailored GTM strategy, not generic marketing assets.

36 1

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Test Kitchen Judge

REQUIRED OUTPUT FORMAT

Scoring Reference

Scores Meaning

Hard Gates (Automatic)

Fitness Gate Interpretation

Feasibility Red Flags

Process

Recommended Agent Skills

css-development

css-development:create-component

css-development:refactor

css-development:validate

ceo-personal-os

gtm-partner