Agent skill
agent-evaluation
Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.
Install this agent skill to your Project
npx add-skill https://github.com/autohandai/community-skills/tree/main/agent-evaluation
Metadata
Additional technical details for this skill
- tags
- agent-evaluation, evals, AI-agents, benchmarks, graders, testing, quality-assurance
- platforms
- Claude, ChatGPT, Gemini
SKILL.md
Agent Evaluation (AI Agent Evals)
Based on Anthropic's "Demystifying evals for AI agents"
When to use this skill
- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time
Core Concepts
Eval Evolution: Single-turn → Multi-turn → Agentic
| Type | Turns | State | Grading | Complexity |
|---|---|---|---|---|
| Single-turn | 1 | None | Simple | Low |
| Multi-turn | N | Conversation | Per-turn | Medium |
| Agentic | N | World + History | Outcome | High |
7 Key Terms
| Term | Definition |
|---|---|
| Task | Single test case (prompt + expected outcome) |
| Trial | One agent run on a task |
| Grader | Scoring function (code/model/human) |
| Transcript | Full record of agent actions |
| Outcome | Final state for grading |
| Harness | Infrastructure running evals |
| Suite | Collection of related tasks |
Instructions
Step 1: Understand Grader Types
Code-based Graders (Recommended for Coding Agents)
- Pros: Fast, objective, reproducible
- Cons: Requires clear success criteria
- Best for: Coding agents, structured outputs
# Example: Code-based grader
def grade_task(outcome: dict) -> float:
"""Grade coding task by test passage."""
tests_passed = outcome.get("tests_passed", 0)
total_tests = outcome.get("total_tests", 1)
return tests_passed / total_tests
# SWE-bench style grader
def grade_swe_bench(repo_path: str, test_spec: dict) -> bool:
"""Run tests and check if patch resolves issue."""
result = subprocess.run(
["pytest", test_spec["test_file"]],
cwd=repo_path,
capture_output=True
)
return result.returncode == 0
Model-based Graders (LLM-as-Judge)
- Pros: Flexible, handles nuance
- Cons: Requires calibration, can be inconsistent
- Best for: Conversational agents, open-ended tasks
# Example: LLM Rubric for Customer Support Agent
rubric:
dimensions:
- name: empathy
weight: 0.3
scale: 1-5
criteria: |
5: Acknowledges emotions, uses warm language
3: Polite but impersonal
1: Cold or dismissive
- name: resolution
weight: 0.5
scale: 1-5
criteria: |
5: Fully resolves issue
3: Partial resolution
1: No resolution
- name: efficiency
weight: 0.2
scale: 1-5
criteria: |
5: Resolved in minimal turns
3: Reasonable turns
1: Excessive back-and-forth
Human Graders
- Pros: Highest accuracy, catches edge cases
- Cons: Expensive, slow, not scalable
- Best for: Final validation, ambiguous cases
Step 2: Choose Strategy by Agent Type
2.1 Coding Agents
Benchmarks:
- SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
- Terminal-Bench: Complex terminal tasks
- Custom test suites with your codebase
Grading Strategy:
def grade_coding_agent(task: dict, outcome: dict) -> dict:
return {
"tests_passed": run_test_suite(outcome["code"]),
"lint_score": run_linter(outcome["code"]),
"builds": check_build(outcome["code"]),
"matches_spec": compare_to_reference(task["spec"], outcome["code"])
}
Key Metrics:
- Test passage rate
- Build success
- Lint/style compliance
- Diff size (smaller is better)
2.2 Conversational Agents
Benchmarks:
- τ2-Bench: Multi-domain conversation
- Custom domain-specific suites
Grading Strategy (Multi-dimensional):
success_criteria:
- empathy_score: >= 4.0
- resolution_rate: >= 0.9
- avg_turns: <= 5
- escalation_rate: <= 0.1
Key Metrics:
- Task resolution rate
- Customer satisfaction proxy
- Turn efficiency
- Escalation rate
2.3 Research Agents
Grading Dimensions:
- Grounding: Claims backed by sources
- Coverage: All aspects addressed
- Source Quality: Authoritative sources used
def grade_research_agent(task: dict, outcome: dict) -> dict:
return {
"grounding": check_citations(outcome["report"]),
"coverage": check_topic_coverage(task["topics"], outcome["report"]),
"source_quality": score_sources(outcome["sources"]),
"factual_accuracy": verify_claims(outcome["claims"])
}
2.4 Computer Use Agents
Benchmarks:
- WebArena: Web navigation tasks
- OSWorld: Desktop environment tasks
Grading Strategy:
def grade_computer_use(task: dict, outcome: dict) -> dict:
return {
"ui_state": verify_ui_state(outcome["screenshot"]),
"db_state": verify_database(task["expected_db_state"]),
"file_state": verify_files(task["expected_files"]),
"success": all_conditions_met(task, outcome)
}
Step 3: Follow the 8-Step Roadmap
Step 0: Start Early (20-50 Tasks)
# Create initial eval suite structure
mkdir -p evals/{tasks,results,graders}
# Start with representative tasks
# - Common use cases (60%)
# - Edge cases (20%)
# - Failure modes (20%)
Step 1: Convert Manual Tests
# Transform existing QA tests into eval tasks
def convert_qa_to_eval(qa_case: dict) -> dict:
return {
"id": qa_case["id"],
"prompt": qa_case["input"],
"expected_outcome": qa_case["expected"],
"grader": "code" if qa_case["has_tests"] else "model",
"tags": qa_case.get("tags", [])
}
Step 2: Ensure Clarity + Reference Solutions
# Good task definition
task:
id: "api-design-001"
prompt: |
Design a REST API for user management with:
- CRUD operations
- Authentication via JWT
- Rate limiting
reference_solution: "./solutions/api-design-001/"
success_criteria:
- "All endpoints documented"
- "Auth middleware present"
- "Rate limit config exists"
Step 3: Balance Positive/Negative Cases
# Ensure eval suite balance
suite_composition = {
"positive_cases": 0.5, # Should succeed
"negative_cases": 0.3, # Should fail gracefully
"edge_cases": 0.2 # Boundary conditions
}
Step 4: Isolate Environments
# Docker-based isolation for coding evals
eval_environment:
type: docker
image: "eval-sandbox:latest"
timeout: 300s
resources:
memory: "4g"
cpu: "2"
network: isolated
cleanup: always
Step 5: Focus on Outcomes, Not Paths
# GOOD: Outcome-focused grader
def grade_outcome(expected: dict, actual: dict) -> float:
return compare_final_states(expected, actual)
# BAD: Path-focused grader (too brittle)
def grade_path(expected_steps: list, actual_steps: list) -> float:
return step_by_step_match(expected_steps, actual_steps)
Step 6: Always Read Transcripts
# Transcript analysis for debugging
def analyze_transcript(transcript: list) -> dict:
return {
"total_steps": len(transcript),
"tool_usage": count_tool_calls(transcript),
"errors": extract_errors(transcript),
"decision_points": find_decision_points(transcript),
"recovery_attempts": find_recovery_patterns(transcript)
}
Step 7: Monitor Eval Saturation
# Detect when evals are no longer useful
def check_saturation(results: list, window: int = 10) -> dict:
recent = results[-window:]
return {
"pass_rate": sum(r["passed"] for r in recent) / len(recent),
"variance": calculate_variance(recent),
"is_saturated": all(r["passed"] for r in recent),
"recommendation": "Add harder tasks" if saturated else "Continue"
}
Step 8: Long-term Maintenance
# Eval suite maintenance checklist
maintenance:
weekly:
- Review failed evals for false negatives
- Check for flaky tests
monthly:
- Add new edge cases from production issues
- Retire saturated evals
- Update reference solutions
quarterly:
- Full benchmark recalibration
- Team contribution review
Step 4: Integrate with Production
CI/CD Integration
# GitHub Actions example
name: Agent Evals
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evals
run: |
python run_evals.py --suite=core --mode=compact
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
Production Monitoring
# Real-time eval sampling
class ProductionMonitor:
def __init__(self, sample_rate: float = 0.1):
self.sample_rate = sample_rate
async def monitor(self, request, response):
if random.random() < self.sample_rate:
eval_result = await self.run_eval(request, response)
self.log_result(eval_result)
if eval_result["score"] < self.threshold:
self.alert("Low quality response detected")
A/B Testing
# Compare agent versions
def run_ab_test(suite: str, versions: list) -> dict:
results = {}
for version in versions:
results[version] = run_eval_suite(suite, agent_version=version)
return {
"comparison": compare_results(results),
"winner": determine_winner(results),
"confidence": calculate_confidence(results)
}
Best Practices
Do's ✅
- Start with 20-50 representative tasks
- Use code-based graders when possible
- Focus on outcomes, not paths
- Read transcripts for debugging
- Monitor for eval saturation
- Balance positive/negative cases
- Isolate eval environments
- Version your eval suites
Don'ts ❌
- Don't over-rely on model-based graders without calibration
- Don't ignore failed evals (false negatives exist)
- Don't grade on intermediate steps
- Don't skip transcript analysis
- Don't use production data without sanitization
- Don't let eval suites become stale
Success Patterns
Pattern 1: Graduated Eval Complexity
Level 1: Unit evals (single capability)
Level 2: Integration evals (combined capabilities)
Level 3: End-to-end evals (full workflows)
Level 4: Adversarial evals (edge cases)
Pattern 2: Eval-Driven Development
1. Write eval task for new feature
2. Run eval (expect failure)
3. Implement feature
4. Run eval (expect pass)
5. Add to regression suite
Pattern 3: Continuous Calibration
Weekly: Review grader accuracy
Monthly: Update rubrics based on feedback
Quarterly: Full grader audit with human baseline
Troubleshooting
Problem: Eval scores at 100%
Solution: Add harder tasks, check for eval saturation (Step 7)
Problem: Inconsistent model-based grader scores
Solution: Add more examples to rubric, use structured output, ensemble graders
Problem: Evals too slow for CI
Solution: Use toon mode, parallelize, sample subset for PR checks
Problem: Agent passes evals but fails in production
Solution: Add production failure cases to eval suite, increase diversity
References
Examples
Example 1: Simple Coding Agent Eval
# Task definition
task = {
"id": "fizzbuzz-001",
"prompt": "Write a fizzbuzz function in Python",
"test_cases": [
{"input": 3, "expected": "Fizz"},
{"input": 5, "expected": "Buzz"},
{"input": 15, "expected": "FizzBuzz"},
{"input": 7, "expected": "7"}
]
}
# Grader
def grade(task, outcome):
code = outcome["code"]
exec(code) # In sandbox
for tc in task["test_cases"]:
if fizzbuzz(tc["input"]) != tc["expected"]:
return 0.0
return 1.0
Example 2: Conversational Agent Eval with LLM Rubric
task:
id: "support-refund-001"
scenario: |
Customer wants refund for damaged product.
Product: Laptop, Order: #12345, Damage: Screen crack
expected_actions:
- Acknowledge issue
- Verify order
- Offer resolution options
max_turns: 5
grader:
type: model
model: claude-3-5-sonnet-20241022
rubric: |
Score 1-5 on each dimension:
- Empathy: Did agent acknowledge customer frustration?
- Resolution: Was a clear solution offered?
- Efficiency: Was issue resolved in reasonable turns?
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
mapping-mitre-attack-techniques
Maps observed adversary behaviors, security alerts, and detection rules to MITRE ATT&CK techniques and sub-techniques to quantify detection coverage and guide control prioritization. Use when building an ATT&CK-based coverage heatmap, tagging SIEM alerts with technique IDs, aligning security controls to adversary playbooks, or reporting threat exposure to executives. Activates for requests involving ATT&CK Navigator, Sigma rules, MITRE D3FEND, or coverage gap analysis.
hunting-for-spearphishing-indicators
Hunt for spearphishing campaign indicators across email logs, endpoint telemetry, and network data to detect targeted email attacks.
analyzing-malicious-url-with-urlscan
URLScan.io is a free service for scanning and analyzing suspicious URLs. It captures screenshots, DOM content, HTTP transactions, JavaScript behavior, and network connections of web pages in an isolat
implementing-zero-standing-privilege-with-cyberark
Deploy CyberArk Secure Cloud Access to eliminate standing privileges in hybrid and multi-cloud environments using just-in-time access with time, entitlement, and approval controls.
implementing-pam-for-database-access
Deploy privileged access management for database systems including Oracle, SQL Server, PostgreSQL, and MySQL. Covers session proxy configuration, credential vaulting, query auditing, dynamic credentia
detecting-t1003-credential-dumping-with-edr
Detect OS credential dumping techniques targeting LSASS memory, SAM database, NTDS.dit, and cached credentials using EDR telemetry, Sysmon process access monitoring, and Windows security event correlation.
Didn't find tool you were looking for?