Agent skill

eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

Stars 8,805
Forks 1,070

Install this agent skill to your Project

npx add-skill https://github.com/alirezarezvani/claude-skills/tree/main/engineering/agenthub/skills/eval

SKILL.md

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

bash
python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

  1. Get the diff: git diff {base_branch}...{agent_branch}
  2. Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
  3. Compare all diffs and rank by:
    • Correctness — Does it solve the task?
    • Simplicity — Fewer lines changed is better (when equal correctness)
    • Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

  1. Run metric evaluation first
  2. If top agents are within 10% of each other, use LLM judge to break ties
  3. Present both metric and qualitative rankings

After Eval

  1. Update session state:
bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
  1. Tell the user:
    • Ranked results with winner highlighted
    • Next step: /hub:merge to merge the winner
    • Or /hub:merge {session-id} --agent {winner} to be explicit

Expand your agent's capabilities with these related and highly-rated skills.

alirezarezvani/claude-skills

business-growth-skills

4 business growth agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Customer success (health scoring, churn), sales engineer (RFP), revenue operations (pipeline, GTM), contract & proposal writer. Python tools (stdlib-only).

8,805 1,070
Explore
alirezarezvani/claude-skills

contract-and-proposal-writer

Contract & Proposal Writer

8,805 1,070
Explore
alirezarezvani/claude-skills

sales-engineer

Analyzes RFP/RFI responses for coverage gaps, builds competitive feature comparison matrices, and plans proof-of-concept (POC) engagements for pre-sales engineering. Use when responding to RFPs, bids, or proposal requests; comparing product features against competitors; planning or scoring a customer POC or sales demo; preparing a technical proposal; or performing win/loss competitor analysis. Handles tasks described as 'RFP response', 'bid response', 'proposal response', 'competitor comparison', 'feature matrix', 'POC planning', 'sales demo prep', or 'pre-sales engineering'.

8,805 1,070
Explore
alirezarezvani/claude-skills

customer-success-manager

Monitors customer health, predicts churn risk, and identifies expansion opportunities using weighted scoring models for SaaS customer success. Use when analyzing customer accounts, reviewing retention metrics, scoring at-risk customers, or when the user mentions churn, customer health scores, upsell opportunities, expansion revenue, retention analysis, or customer analytics. Runs three Python CLI tools to produce deterministic health scores, churn risk tiers, and prioritized expansion recommendations across Enterprise, Mid-Market, and SMB segments.

8,805 1,070
Explore
alirezarezvani/claude-skills

revenue-operations

Analyzes sales pipeline health, revenue forecasting accuracy, and go-to-market efficiency metrics for SaaS revenue optimization. Use when analyzing sales pipeline coverage, forecasting revenue, evaluating go-to-market performance, reviewing sales metrics, assessing pipeline analysis, tracking forecast accuracy with MAPE, calculating GTM efficiency, or measuring sales efficiency and unit economics for SaaS teams.

8,805 1,070
Explore
alirezarezvani/claude-skills

marketing-skills

42 marketing agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw, and 6 more coding agents. 7 pods: content, SEO, CRO, channels, growth, intelligence, sales. Foundation context + orchestration router. 27 Python tools (stdlib-only).

8,805 1,070
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results