Agent skill
llm-judge
Compare code implementations across 2+ repos using LLM-as-judge methodology with weighted scoring
Install this agent skill to your Project
npx add-skill https://github.com/existential-birds/beagle/tree/main/plugins/beagle-analysis/skills/llm-judge
SKILL.md
LLM Judge
Compare code implementations across multiple repositories using structured evaluation.
Usage
/beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
Arguments
| Argument | Required | Description |
|---|---|---|
spec |
Yes | Path to spec/requirements document |
repos |
Yes | 2+ paths to repositories to compare |
--labels |
No | Comma-separated labels (default: directory names) |
--weights |
No | Override weights, e.g. functionality:40,security:30 |
--branch |
No | Branch to compare against main (default: main) |
Workflow
- Parse
$ARGUMENTSintospec_path,repo_paths,labels,weights, andbranch. - Validate the spec file, each repo path, and the minimum repo count.
- Read the spec document into memory.
- Load this skill and the supporting reference files.
- Spawn one Phase 1 repo agent per repository to gather facts only.
- Validate the repo-agent JSON results before proceeding.
- Spawn one Phase 2 judge agent per dimension.
- Aggregate scores, compute weighted totals, rank repos, and write the report.
- Display the markdown summary and verify the JSON report.
Command Workflow
Step 1: Parse Arguments
Parse $ARGUMENTS to extract:
spec_path: first positional argumentrepo_paths: remaining positional arguments (must be 2+)labels: from--labelsor derived from directory namesweights: from--weightsor defaultsbranch: from--branchormain
Default Weights:
{
"functionality": 30,
"security": 25,
"tests": 20,
"overengineering": 15,
"dead_code": 10
}
Step 2: Validate Inputs
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }
for repo in "${REPO_PATHS[@]}"; do
[ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
Step 3: Read Spec Document
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
Step 4: Load the Skill
Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")
Step 5: Phase 1 - Spawn Repo Agents
Spawn one Task per repo:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/repo-agent.md for detailed instructions
3. Read references/fact-schema.md for the output format
4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis
Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.
Do NOT score or judge. Only gather facts.
Collect all repo outputs into ALL_FACTS.
Step 6: Validate Phase 1 Results
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
Step 7: Phase 2 - Spawn Judge Agents
Spawn five judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/judge-agents.md for detailed instructions
3. Read references/scoring-rubrics.md for the $DIMENSION rubric
Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
Step 8: Aggregate Scores
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]
weighted_total = sum(
scores[repo_label][dim]['score'] * weights[dim] / 100
for dim in dimensions
)
scores[repo_label]['weighted_total'] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)
Step 9: Generate Verdict
Name the winner, explain why they won, and note any close calls or trade-offs.
Step 10: Write JSON Report
mkdir -p .beagle
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Step 11: Display Summary
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
Step 12: Verification
python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"
Output Shape
The generated report should include:
- repo labels and paths
- per-dimension scores and justifications
- weighted totals and ranking
- a verdict explaining the winner
Reference Files
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Model
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Collect all repo-agent outputs into ALL_FACTS.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
Aggregation
- Collect the five judge outputs.
- Compute each repo's weighted total with the configured weights.
- Rank repos by weighted total in descending order.
- Generate a verdict that explains the result and any close calls.
- Write
.beagle/llm-judge-report.json.
Output
Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Verification
Before completing:
- Verify
.beagle/llm-judge-report.jsonexists and is valid JSON. - Verify all repos have scores for all dimensions.
- Verify weighted totals sum correctly.
Rules
- Always validate inputs before proceeding
- Spawn Phase 1 agents in parallel, then wait before Phase 2
- Spawn Phase 2 agents in parallel, one per dimension
- Every score must have a justification
- Write the JSON report before displaying the summary
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
review-python
Comprehensive Python/FastAPI backend code review with optional parallel agents
review-verification-protocol
Mandatory verification steps for all code reviews to reduce false positives. Load this skill before reporting ANY code review findings.
sqlalchemy-code-review
Reviews SQLAlchemy code for session management, relationships, N+1 queries, and migration patterns. Use when reviewing SQLAlchemy 2.0 code, checking session lifecycle, relationship() usage, or Alembic migrations.
fastapi-code-review
Reviews FastAPI code for routing patterns, dependency injection, validation, and async handlers. Use when reviewing FastAPI apps, checking APIRouter setup, Depends() usage, or response models.
pytest-code-review
Reviews pytest test code for async patterns, fixtures, parametrize, and mocking. Use when reviewing test_*.py files, checking async test functions, fixture usage, or mock patterns.
postgres-code-review
Reviews PostgreSQL code for indexing strategies, JSONB operations, connection pooling, and transaction safety. Use when reviewing SQL queries, database schemas, JSONB usage, or connection management.
Didn't find tool you were looking for?