Agent skill

judge-llm

Ultrathink LLM-as-Judge validation of completed work. Uses extended thinking by DEFAULT for thorough evaluation.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/judge-llm

SKILL.md

/sw:judge-llm - Ultrathink LLM-as-Judge Validation

ULTRATHINK BY DEFAULT - Validate completed work using extended thinking and the LLM-as-Judge pattern.

Implementation: Opus Model + Timeout Handling

Model: opus for deepest analysis Timeout: 60 seconds default (configurable with --timeout) Progress Log: .specweave/logs/judge-llm.log

Implementation in src/core/skills/skill-judge.ts:

  • Uses Anthropic SDK with user's ANTHROPIC_API_KEY
  • AbortController-based timeout to prevent stuck states
  • Progress logging for visibility during evaluation
  • Fallback to basic pattern matching if no API key

CRITICAL: Extended Thinking is DEFAULT

This command ALWAYS uses ultrathink (extended thinking) for thorough LLM-as-Judge evaluation:

DEFAULT BEHAVIOR = ULTRATHINK MODE
- Extended thinking enabled
- Deep chain-of-thought reasoning
- Thorough multi-dimensional analysis
- ~60-90 seconds for comprehensive evaluation
- Uses Opus model for maximum quality

Use --quick only if you explicitly need faster (but less thorough) validation.

Purpose

Use when you've completed work and want maximum-quality AI validation:

  • Works on any files (not just SpecWeave increments)
  • Uses ultrathink extended thinking for deepest analysis
  • Returns clear verdict with detailed reasoning

Usage

bash
# DEFAULT: Ultrathink validation (recommended)
/sw:judge-llm src/file.ts
/sw:judge-llm "src/**/*.ts"

# Validate git changes (ultrathink by default)
/sw:judge-llm --staged           # Staged changes
/sw:judge-llm --last-commit      # Last commit
/sw:judge-llm --diff main        # Diff vs branch

# Quick mode (ONLY if you need speed over thoroughness)
/sw:judge-llm src/file.ts --quick

# Timeout control (default: 60s)
/sw:judge-llm src/file.ts --timeout 120000   # 120 seconds
/sw:judge-llm src/file.ts --timeout 30000    # 30 seconds (faster cutoff)

# Additional options
/sw:judge-llm src/file.ts --strict   # Fail on any concern
/sw:judge-llm src/file.ts --fix      # Include fix suggestions
/sw:judge-llm src/file.ts --export   # Export report to markdown
/sw:judge-llm src/file.ts --verbose  # Show progress to console

Visibility & Stuck Detection

Progress is always logged to .specweave/logs/judge-llm.log:

[2026-01-19T10:30:00.000Z] [0.0s] [INFO] Starting LLM Judge evaluation for domain: backend
[2026-01-19T10:30:00.001Z] [0.0s] [INFO] Task: Validate authentication implementation...
[2026-01-19T10:30:00.002Z] [0.0s] [INFO] Using model: opus
[2026-01-19T10:30:00.003Z] [0.0s] [INFO] Timeout: 60000ms
[2026-01-19T10:30:00.004Z] [0.0s] [PROGRESS] Sending request to Opus...
[2026-01-19T10:30:45.000Z] [45.0s] [PROGRESS] Response received, parsing...

If evaluation gets stuck:

  1. Check .specweave/logs/judge-llm.log for last progress
  2. Default timeout (60s) will abort if stuck
  3. Increase timeout with --timeout if legitimately slow
  4. Result will show timedOut: true if aborted

How It Works

When you invoke /sw:judge-llm, Claude will:

Step 1: Gather Input

Determine what to validate:

  • If file paths provided → read those files
  • If --staged → get staged git changes
  • If --last-commit → get files from last commit
  • If --diff <branch> → get diff against branch
  • If no args → validate recent work in conversation context

Step 2: ULTRATHINK Analysis (Default)

MANDATORY: Use extended thinking for deep LLM-as-Judge evaluation:

Claude MUST use ultrathink/extended thinking to:

1. **DEEP READ**: Thoroughly understand all code, context, and intent
2. **MULTI-DIMENSIONAL ANALYSIS**: Evaluate across ALL dimensions:
   - Correctness: Does it work exactly as intended?
   - Completeness: ALL edge cases handled? ALL requirements met?
   - Security: ANY vulnerabilities? OWASP Top 10 checked?
   - Performance: Algorithmic complexity? Memory usage? Bottlenecks?
   - Maintainability: Clean? Clear? Follows conventions?
   - Testability: Can it be tested? Are tests adequate?
   - Error handling: All failure modes covered?
3. **CRITICAL EVALUATION**: Weigh ALL findings by severity
4. **REASONED VERDICT**: Form verdict based on thorough analysis

Step 3: Return Verdict

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED | CONCERNS | REJECTED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.XX
Files Analyzed: N

REASONING:
[Detailed chain-of-thought from extended thinking]

ISSUES (if any):
🔴 CRITICAL: [title]
   [description]
   📍 [file:line]
   💡 [suggestion]

🟡 HIGH: [title]
   ...

🟢 LOW: [title]
   ...

VERDICT: [summary sentence]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Verdict Meanings

Verdict Meaning Action
APPROVED Work is solid, no significant issues Safe to proceed
CONCERNS Issues found worth addressing Review and fix recommended
REJECTED Critical issues found MUST fix before proceeding

Validation Modes

Default Mode (ULTRATHINK) - RECOMMENDED

  • Extended thinking ENABLED by default
  • Most thorough validation (~60-90 seconds)
  • Deep multi-dimensional analysis
  • Best for any completed work
  • Cost: ~$0.10-0.25

Quick Mode (--quick)

  • Fast validation (~10-15 seconds)
  • Standard reasoning (no extended thinking)
  • Good for quick sanity checks during development
  • Cost: ~$0.02-0.05

Strict Mode (--strict)

  • Any concern results in REJECTED
  • Use for critical paths, security code, or CI gates
  • Combines with ultrathink by default

Examples

Example 1: Default ultrathink validation

User: /sw:judge-llm src/core/auth/login.ts

Claude: [Uses extended thinking for deep analysis]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.91
Files Analyzed: 1

REASONING:
After thorough analysis with extended thinking:

The login implementation demonstrates excellent security practices:
- Password hashing uses bcrypt with cost factor 12 (appropriate)
- Rate limiting implemented correctly (5 attempts, 15 min exponential backoff)
- Input validation prevents SQL injection and XSS
- Error messages are generic (don't leak user existence)
- Session tokens use cryptographically secure random generation
- CSRF protection properly implemented

Edge cases handled:
- Empty input validation ✓
- Unicode normalization for usernames ✓
- Timing attack mitigation via constant-time comparison ✓

No security, performance, or maintainability issues found.

VERDICT: Production-ready implementation with excellent security posture.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example 2: Validate staged changes

User: /sw:judge-llm --staged

Claude: [Uses extended thinking]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: CONCERNS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.84
Files Analyzed: 3

REASONING:
Extended thinking analysis of staged changes reveals:

Positive aspects:
- New API endpoint follows existing patterns
- TypeScript types are correct
- Error handling present

However, thorough analysis found issues:

🟡 HIGH: Missing Input Validation
   User input passed to database without sanitization
   📍 src/api/users.ts:45
   💡 Add Zod schema validation:
   ```typescript
   const schema = z.object({ userId: z.string().uuid() });
   const { userId } = schema.parse(req.body);

🟡 HIGH: Information Disclosure Risk Stack traces exposed in error responses 📍 src/api/users.ts:62 💡 Use production error handler that sanitizes output

🟢 LOW: Missing rate limiting New endpoint has no rate limiting 📍 src/api/users.ts:30 💡 Add rate limiter middleware

VERDICT: Address HIGH issues before merging. LOW can be follow-up. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


### Example 3: Quick validation (when needed)

User: /sw:judge-llm src/utils/format.ts --quick

Claude: [Standard reasoning, no extended thinking]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: QUICK (standard reasoning) Confidence: 0.75 Files Analyzed: 1

REASONING: Utility formatting functions look correct. No obvious issues.

VERDICT: Looks good for a utility file. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


## Simplest Usage

Just say in your prompt:

"judge-llm my work" "use judge-llm" "judge-llm this"


Claude will:
1. Automatically gather context from the conversation
2. Use ULTRATHINK extended thinking by default
3. Apply thorough LLM-as-Judge evaluation

## Difference from /sw:qa

| Aspect | `/sw:qa` | `/sw:judge-llm` |
|--------|-----------------|------------------------|
| **Scope** | Increments only | Any files |
| **Input** | Increment ID | Files, git diff, context |
| **Default Mode** | Standard | **ULTRATHINK** |
| **Pattern** | 7-dimension scoring | Judge LLM reasoning |
| **Focus** | Spec quality, risks | Code correctness |
| **When** | Before increment close | After any work |

## Best Practices

1. **Use by default**: Ultrathink is worth the extra time for quality
2. **Use `--staged`**: Validate before committing
3. **Use `--strict` for critical code**: Payment, auth, security
4. **Fix CRITICAL issues immediately**: Never ignore these
5. **Trust the ultrathink analysis**: Extended thinking catches subtle issues

## Limitations

- ❌ Doesn't execute tests (use test runners)
- ❌ Doesn't auto-apply fixes (only suggests)
- ❌ May miss domain-specific issues
- ❌ Not a replacement for human review

## Related

- `/sw:qa` - Increment-bound quality assessment
- `/sw:validate` - Rule-based increment validation
- `ado-sync-judge` agent - Uses judge pattern for sync validation

Didn't find tool you were looking for?

Be as detailed as possible for better results