Agent skill

llm-judge

Compare code implementations across 2+ repos using LLM-as-judge methodology with weighted scoring

View SKILL.md on GitHub Repository

Stars 44

Forks 5

Install this agent skill to your Project

npx add-skill https://github.com/existential-birds/beagle/tree/main/plugins/beagle-analysis/skills/llm-judge

SKILL.md

LLM Judge

Compare code implementations across multiple repositories using structured evaluation.

Usage

bash

/beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]

Arguments

Argument	Required	Description
`spec`	Yes	Path to spec/requirements document
`repos`	Yes	2+ paths to repositories to compare
`--labels`	No	Comma-separated labels (default: directory names)
`--weights`	No	Override weights, e.g. `functionality:40,security:30`
`--branch`	No	Branch to compare against main (default: `main`)

Workflow

Parse $ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.
Validate the spec file, each repo path, and the minimum repo count.
Read the spec document into memory.
Load this skill and the supporting reference files.
Spawn one Phase 1 repo agent per repository to gather facts only.
Validate the repo-agent JSON results before proceeding.
Spawn one Phase 2 judge agent per dimension.
Aggregate scores, compute weighted totals, rank repos, and write the report.
Display the markdown summary and verify the JSON report.

Command Workflow

Step 1: Parse Arguments

Parse $ARGUMENTS to extract:

spec_path: first positional argument
repo_paths: remaining positional arguments (must be 2+)
labels: from --labels or derived from directory names
weights: from --weights or defaults
branch: from --branch or main

Default Weights:

json

{
  "functionality": 30,
  "security": 25,
  "tests": 20,
  "overengineering": 15,
  "dead_code": 10
}

Step 2: Validate Inputs

bash

[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }

for repo in "${REPO_PATHS[@]}"; do
  [ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done

[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }

Step 3: Read Spec Document

bash

SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }

Step 4: Load the Skill

Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")

Step 5: Phase 1 - Spawn Repo Agents

Spawn one Task per repo:

text

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $LABEL at $REPO_PATH

**Spec Document:**
$SPEC_CONTENT

**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/repo-agent.md for detailed instructions
3. Read references/fact-schema.md for the output format
4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis

Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.

Do NOT score or judge. Only gather facts.

Collect all repo outputs into ALL_FACTS.

Step 6: Validate Phase 1 Results

bash

echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }

Step 7: Phase 2 - Spawn Judge Agents

Spawn five judge agents, one per dimension:

text

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/judge-agents.md for detailed instructions
3. Read references/scoring-rubrics.md for the $DIMENSION rubric

Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.

Step 8: Aggregate Scores

python

for repo_label in labels:
    scores[repo_label] = {}
    for dimension in dimensions:
        scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]

    weighted_total = sum(
        scores[repo_label][dim]['score'] * weights[dim] / 100
        for dim in dimensions
    )
    scores[repo_label]['weighted_total'] = round(weighted_total, 2)

ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)

Step 9: Generate Verdict

Name the winner, explain why they won, and note any close calls or trade-offs.

Step 10: Write JSON Report

bash

mkdir -p .beagle

Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.

Step 11: Display Summary

Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.

Step 12: Verification

bash

python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"

Output Shape

The generated report should include:

repo labels and paths
per-dimension scores and justifications
weighted totals and ranking
a verdict explaining the winner

Reference Files

File	Purpose
references/fact-schema.md	JSON schema for Phase 1 facts
references/scoring-rubrics.md	Detailed rubrics for each dimension
references/repo-agent.md	Instructions for Phase 1 agents
references/judge-agents.md	Instructions for Phase 2 judges

Scoring Model

Dimension	Default Weight	Evaluates
Functionality	30%	Spec compliance, test pass rate
Security	25%	Vulnerabilities, security patterns
Test Quality	20%	Coverage, DRY, mock boundaries
Overengineering	15%	Unnecessary complexity
Dead Code	10%	Unused code, TODOs

Scoring Scale

Score	Meaning
5	Excellent - Exceeds expectations
4	Good - Meets requirements, minor issues
3	Average - Functional but notable gaps
2	Below Average - Significant issues
1	Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

text

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Collect all repo-agent outputs into ALL_FACTS.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:

text

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

Collect the five judge outputs.
Compute each repo's weighted total with the configured weights.
Rank repos by weighted total in descending order.
Generate a verdict that explains the result and any close calls.
Write .beagle/llm-judge-report.json.

Output

Display a markdown summary with scores, ranking, verdict, and detailed justifications.

Verification

Before completing:

Verify .beagle/llm-judge-report.json exists and is valid JSON.
Verify all repos have scores for all dimensions.
Verify weighted totals sum correctly.

Rules

Always validate inputs before proceeding
Spawn Phase 1 agents in parallel, then wait before Phase 2
Spawn Phase 2 agents in parallel, one per dimension
Every score must have a justification
Write the JSON report before displaying the summary

Maintainer

existential-birds Core maintainer

Source details

Full Name: existential-birds/beagle
Branch: main
Path in repo: plugins/beagle-analysis/skills/llm-judge
License: Apache License 2.0
Topics: claude-code agent-skills typescript ai-agents developer-tools python react golang code-review rust langgraph swift elixir documentation swiftui fastapi axum phoenix pydantic-ai tokio

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

existential-birds/beagle

review-python

Comprehensive Python/FastAPI backend code review with optional parallel agents

44 5

Explore

existential-birds/beagle

review-verification-protocol

Mandatory verification steps for all code reviews to reduce false positives. Load this skill before reporting ANY code review findings.

44 5

Explore

existential-birds/beagle

sqlalchemy-code-review

Reviews SQLAlchemy code for session management, relationships, N+1 queries, and migration patterns. Use when reviewing SQLAlchemy 2.0 code, checking session lifecycle, relationship() usage, or Alembic migrations.

44 5

Explore

existential-birds/beagle

fastapi-code-review

Reviews FastAPI code for routing patterns, dependency injection, validation, and async handlers. Use when reviewing FastAPI apps, checking APIRouter setup, Depends() usage, or response models.

44 5

Explore

existential-birds/beagle

pytest-code-review

Reviews pytest test code for async patterns, fixtures, parametrize, and mocking. Use when reviewing test_*.py files, checking async test functions, fixture usage, or mock patterns.

44 5

Explore

existential-birds/beagle

postgres-code-review

Reviews PostgreSQL code for indexing strategies, JSONB operations, connection pooling, and transaction safety. Use when reviewing SQL queries, database schemas, JSONB usage, or connection management.

44 5

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

LLM Judge

Usage

Arguments

Workflow

Command Workflow

Step 1: Parse Arguments

Step 2: Validate Inputs

Step 3: Read Spec Document

Step 4: Load the Skill

Step 5: Phase 1 - Spawn Repo Agents

Step 6: Validate Phase 1 Results

Step 7: Phase 2 - Spawn Judge Agents

Step 8: Aggregate Scores

Step 9: Generate Verdict

Step 10: Write JSON Report

Step 11: Display Summary

Step 12: Verification

Output Shape

Reference Files

Scoring Model

Scoring Scale

Phase 1: Spawning Repo Agents

Phase 2: Spawning Judge Agents

Aggregation

Output

Verification

Rules

Recommended Agent Skills

review-python

review-verification-protocol

sqlalchemy-code-review

fastapi-code-review

pytest-code-review

postgres-code-review