Agent skill
eval
Plan and run conversational AI agent evaluations with test generation and analysis.
Install this agent skill to your Project
npx add-skill https://github.com/mikeyobrien/rho/tree/main/skills/eval
SKILL.md
EvalKit
Overview
EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results.
Parameters
- agent_path (required): Path to the agent folder to evaluate (e.g.,
./chatbot-agent,/path/to/my-agent) - evaluation_focus (optional): Specific aspects to evaluate (e.g., "response quality", "tool calling accuracy")
- test_case_count (optional, default: 3): Number of test cases to generate
- output_format (optional, default: "jsonl"): Format for test data output
Constraints for parameter acquisition:
- You MUST ask for the agent_path if not provided because evaluation cannot proceed without knowing which agent to evaluate
- You MUST support multiple input methods including:
- Direct input: Path provided directly in the conversation
- Relative paths: Paths relative to current working directory
- Absolute paths: Full system paths to agent location
- You SHOULD infer evaluation_focus from user's natural language description when not explicitly provided
- You MAY use intelligent defaults for optional parameters based on agent analysis
Steps
1. Setup and Initialization
When a user requests evaluation (any phase), first validate the environment:
Folder Structure:
All evaluation artifacts MUST be created in the eval/ folder at the same level as the target agent folder:
<agent-evaluation-project>/ # Example name - can be any name for user's evaluation project
├── <target-agent-folder>/ # Example name - this is the agent you are evaluating
│ └── [agent source code] # Existing agent code
└── eval/ # All evaluation files go here (sibling to target-agent-folder)
├── eval-plan.md
├── test-cases.jsonl
├── results/
├── run_evaluation.py
├── eval-report.md
└── README.md
Note:
- The
eval/folder is a sibling directory to user's agent folder, not nested inside it agent-evaluation-projectandtarget-agent-folderare placeholder names - user may use any names that fit their project
Constraints:
- You MUST check if the agent folder exists
- You MUST verify Python 3.11+ is installed
- You MUST navigate to the evaluation project directory (containing both agent folder and eval/) before any operation
- You MUST create the eval/ folder as a sibling to the agent folder
- You MUST NOT create evaluation folders inside the agent folder
- You MUST create the eval/ directory at the same level as the agent folder if it doesn't exist
- You MUST ensure all evaluation artifacts are within the eval/ folder
- You MUST check for any existing evaluation artifacts in the eval/ folder
- You SHOULD validate that required dependencies are available
- You MUST use relative paths from the evaluation project directory (e.g., "./eval/eval-plan.md") for all file operations
2. Planning Phase
When to Trigger: User requests evaluation planning or mentions creating/designing an evaluation
User Intent Recognition:
- Keywords: "plan", "design", "create evaluation", "evaluate my agent"
- Context: User provides agent path or describes agent to evaluate
- Goal: Understand what the user wants to evaluate and why
Execution Flow:
-
Parse user request: Extract agent path, evaluation focus, and specific requirements from natural language
-
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/ -
Create evaluation directory structure:
bashmkdir -p eval -
Follow this execution flow:
- Parse user evaluation requirements from user input
- Analyze agent and user requirements:
- Parse specific evaluation requirements, scenarios, and constraints from user input
- Scan codebase for agent architecture and capabilities
- Check for existing test cases and evaluation files
- Design evaluation strategy:
- Define evaluation areas and metrics (user-request-driven with agent-aware defaults)
- Identify test data requirements
- Define file structure
- Select technology stack
-
Write the complete evaluation plan to
eval/eval-plan.mdusing the template structure (see Appendix A: Evaluation Plan Template), replacing placeholders with concrete details derived from the analysis while preserving section order and headings. -
Report completion with evaluation plan file path, and suggest next step: "Would you like me to generate test cases based on this plan?"
Evaluation Planning Guidelines
Design Principles
High-Level Design (What & Why):
- Focus on WHAT to evaluate and WHY it matters for the agent
- Define evaluation areas and metrics that are measurable and verifiable
- Ensure requirements can be tested through actual agent execution
Low-Level Implementation (How):
- Select appropriate technology stack and architecture
- Design practical file structure and execution approach
- Choose integration patterns and configuration methods
Metrics Guidelines
Evaluation metrics must be:
- Measurable: Define what will be measured
- Verifiable: Can be measured through actual agent execution
- Implementation-ready: Clear enough to guide technical implementation
Architecture Principles
Key Principles:
- Simple Structure: Use the flat
eval/directory structure - Real Agent Focus: Always use actual agent execution, never simulation or mock
- Focused Implementation: Avoid over-engineering, focus on core evaluation logic
- Minimal Viable Implementation: Start with essential components, add complexity incrementally
- Framework-First: Leverage existing evaluation frameworks before building custom solutions
- Modular Design: Create reusable components that can be easily tested and maintained
Technology Selection Defaults
Examples of reasonable defaults:
- Evaluation Framework: Strands Evals SDK
- LLM calling service: Built into Strands framework
- LLM provider: Amazon Bedrock
- Data processing: JSON or JSONL
- Agent integration: Direct imports for Python agents
Constraints:
- You MUST prioritize user evaluation requests over detected agent state
- You MUST create eval-plan.md using the template structure
- You MUST analyze agent architecture and capabilities in target-agent-folder
- You MUST define evaluation areas and metrics (user-request-driven with agent-aware defaults)
- You MUST make informed decisions without requiring excessive user clarification
- You MUST save the evaluation plan to eval/eval-plan.md (sibling to agent folder)
- You MUST ensure the eval/ folder is at the same level as the agent folder
3. Test Data Generation Phase
When to Trigger: User requests test case generation or mentions creating test data
User Intent Recognition:
- Keywords: "generate test cases", "create tests", "test data", "test scenarios"
- Context: Evaluation plan exists
- Goal: Create comprehensive test cases
Execution Flow:
-
Parse user request: Extract any specific requirements (e.g., "focus on edge cases", "10 test cases")
-
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/ -
Load the current evaluation plan (
eval/eval-plan.md) to understand evaluation areas and test data requirements. -
Follow this execution flow:
- Parse user context from user input (if provided)
- Validate that the evaluation plan contains test data requirements; update the evaluation plan if it does not align with the user's input (if provided); add entry to User Requirements Log in eval-plan.md
- Generate proper test cases covering all scenarios and meeting all requirements
- Structure test cases in JSONL format
- Save test cases to
eval/test-cases.jsonl - Update Evaluation Progress section in eval-plan.md with completion status
-
Report completion with test case count, coverage summary, and suggest next step: "Would you like me to run the evaluation with these test cases?"
Data Generation Guidelines
- Prioritize user-specific data requests: User input takes precedence over the established evaluation plan - always honor specific user requirements and constraints. Update the evaluation plan if needed.
Constraints:
- You MUST load and validate the evaluation plan from eval/eval-plan.md
- You MUST prioritize user-specific data requests over established evaluation plan
- You MUST generate data in JSONL format
- You MUST save test cases to eval/test-cases.jsonl
- You MUST ensure all files remain within the eval/ folder
- You MUST update Evaluation Progress section in eval-plan.md
4. Evaluation Implementation and Execution Phase
When to Trigger: User requests evaluation execution or mentions running tests
User Intent Recognition:
- Keywords: "run evaluation", "execute", "run tests", "evaluate"
- Context: Test cases exist
- Goal: Execute evaluation and generate results
Execution Flow:
-
Parse user request: Extract any specific requirements (e.g., "run on subset", "verbose output")
-
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/ -
Load the current evaluation plan (
eval/eval-plan.md) to understand evaluation requirements and agent architecture. -
Follow this execution flow:
- Parse user context from user input (if provided)
- Review evaluation plan to understand requirements; update the evaluation plan if it does not align with the user's input (if provided); add entry to User Requirements Log in eval-plan.md
- Implement Strands Evals SDK evaluation pipeline:
IMPORTANT: Always navigate to repository root before any operation in the following process to avoid path errors.
- Create requirements.txt: Detect existing dependencies and consolidate into unified
requirements.txtat repository root, adding Strands Evals SDK dependencies - Set up environment: Use
uvto create virtual environment, activate it, and installrequirements.txt - Implement run_evaluation.py: Create
eval/run_evaluation.pyusing Strands Evals SDK patterns with Case objects, Experiment class, and appropriate evaluators - Create agent integration: Implement agent execution logic within the evaluation framework
- Execute evaluation: Run the experiment to generate evaluation results
- Save results: Store evaluation results in
eval/results/directory - Create documentation: Create
eval/README.mdwith running instructions for users
- Create requirements.txt: Detect existing dependencies and consolidate into unified
- Update Evaluation Progress section in eval-plan.md with completion status
-
Report completion with evaluation results summary and suggest next step: "Would you like me to analyze these results and provide recommendations?"
Implementation Guidelines
CRITICAL: Always Create Minimal Working Version: Implement the most basic version that works
Strands Evals SDK Integration
CRITICAL REQUIREMENT - Getting Latest Documentation: Before implementing evaluation code, you MUST retrieve the latest Strands Evals SDK documentation and API usage examples. This is NOT optional. You MUST NOT proceed with implementation without either context7 access or the source code. This ensures you're using the most current patterns and avoiding deprecated APIs.
Step 1: Check Context7 MCP Availability: First, check if context7 MCP server is available by attempting to use it. If you receive an error indicating context7 is not available, proceed to Step 3.
Step 2: Primary Method - Using Context7 (If Available):
- Use context7 to get library documentation: "Get documentation for strands-agents-evals focusing on Case, Experiment, and Evaluator classes"
- Review the latest API patterns and examples
- Implement evaluation code using the current API
Step 3: Fallback Method - REQUIRED If Context7 Is Not Available: If context7 MCP is not installed or doesn't have Strands Evals SDK documentation, you MUST STOP and prompt the user to take one of these actions:
REQUIRED USER ACTION - Choose ONE of the following:
Option 1: Install Context7 MCP Server (Recommended)
Please install the context7 MCP server in your coding assistant to access the latest Strands Evals SDK documentation. Installation steps vary by assistant:
- For your specific coding assistant: Check your assistant's documentation on how to install MCP servers
- Context7 MCP package:
@upstash/context7-mcp - Common installation: Many assistants support adding MCP servers through their settings/configuration
Note: If you're unsure how to install MCP servers in your coding assistant, please consult your assistant's support resources or choose Option 2 below (clone source code).
After installation, you'll be able to query: "Get documentation for strands-agents-evals focusing on Case, Experiment, and Evaluator classes"
Option 2: Clone Strands Evals SDK Source Code
If you cannot install context7 MCP or prefer to work with source code directly:
cd <your-evaluation-project>
git clone https://github.com/strands-agents/evals strands-agents-evals-source
IMPORTANT: You MUST NOT proceed with implementation until the user has completed one of these options. Do NOT attempt to implement evaluation code using only the reference examples in Appendix C, as they may be outdated.
After the user confirms they've completed one of the above options:
If Context7 was installed:
- Use context7 to get the latest Strands Evals SDK documentation
- Review the latest API patterns and examples
- Implement evaluation code using the current API
If source code was cloned:
- Read the source files to understand the current API:
strands-agents-evals-source/src/strands_evals/ - Check examples in the repository:
strands-agents-evals-source/examples/ - Review API definitions and usage patterns
- Implement evaluation code based on the actual source code
Core Components:
- Case objects: Individual test cases with input, expected output, and metadata
- Experiment class: Collection of cases with evaluator for running evaluations
- Built-in evaluators: OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator
- Direct execution: Agent execution with evaluation, no separate trace collection needed
Environment Setup Guidelines
-
Check Existing Requirements: Verify requirements.txt exists in repository root
bash# Check if requirements.txt exists ls requirements.txt -
Add Strands Evals SDK Dependencies: Update existing requirements.txt with Strands evaluation dependencies
bash# Add Strands Evals SDK and related dependencies grep -q "strands-agents-evals" requirements.txt || echo "strands-agents-evals" >> requirements.txt # Add other evaluation-specific dependencies as needed based on evaluation plan -
Installation: Use
uvfor dependency managementbashuv venv source .venv/bin/activate uv pip install -r requirements.txt
Common Pitfalls to Avoid
- Over-Engineering: Don't add complexity before the basic version works
- Ignoring the Plan: Follow the established evaluation plan structure and requirements
- Separate Trace Collection: Don't implement separate trace collection - Strands Evals SDK handles this automatically
Constraints:
- You MUST implement evaluation in eval/run_evaluation.py using Strands Evals SDK
- You MUST ensure all evaluation code files are within eval/
- You MUST always create minimal working version first
- You MUST execute evaluation and save results to eval/results/
- You MUST create eval/README.md with running instructions
- You MUST keep all evaluation artifacts within the eval/ folder
- You MUST update Evaluation Progress section in eval-plan.md
- You MUST NOT implement separate trace collection - Strands Evals SDK handles this automatically
5. Analysis and Reporting Phase
When to Trigger: User requests results analysis or mentions generating a report
User Intent Recognition:
- Keywords: "analyze results", "generate report", "recommendations", "what should I improve"
- Context: Evaluation results exist
- Goal: Provide actionable insights and recommendations
Execution Flow:
-
Parse user request: Extract any specific analysis focus (e.g., "focus on failures", "prioritize critical issues")
-
Navigate to evaluation project directory:
cd <your-evaluation-project> # Navigate to the directory containing both agent folder and eval/ -
Load and analyze the evaluation results from
eval/results/ -
Follow this execution flow:
- Parse user context from user input (if provided); add entry to User Requirements Log in eval-plan.md
- Load and validate evaluation results data
- Perform comprehensive results analysis
- Identify patterns, strengths, and weaknesses
- Generate actionable improvement recommendations
- Create detailed advisory report with evidence
- Provide prioritized action items for agent enhancement
- Update Evaluation Progress section in eval-plan.md with completion status
-
Results Analysis Process:
a. Data Validation: Ensure results are from real execution:
- Load evaluation results from the specified path
- Validate that results come from actual agent execution (not simulation)
- Verify data completeness and format consistency
b. Results Analysis: Analyze evaluation outcomes:
- Success Rate: Calculate overall success/failure rates
- Quality Scores: Evaluation metric performance across test cases
- Failure Patterns: Common error types and their frequency
- Strengths & Weaknesses: Areas of strong vs. poor performance
c. Insights Generation: Identify key findings:
- Root Causes: Why certain metrics underperform
- Improvement Opportunities: Specific areas for enhancement
- Quality Trends: Patterns in evaluation scores and response quality
-
Improvement Recommendations: Generate specific, actionable recommendations:
a. Prioritized Recommendations: Based on evaluation findings:
Critical Issues (Immediate attention required)
- Address high failure rates or low quality scores
- Fix systematic errors in reasoning or response generation
Quality Improvements (Medium-term enhancements)
- Improve consistency across test cases
- Enhance response completeness and accuracy
Enhancement Opportunities (Future improvements)
- Handle edge cases more effectively
- Improve response clarity and formatting
b. Evidence-Based Recommendations: All recommendations must cite specific data:
- Issue: Clear problem statement with evaluation metrics
- Evidence: Specific data points from results
- Recommended Actions: Specific improvement suggestions
- Expected Impact: Predicted improvements in evaluation scores
-
Advisory Report Generation: Create focused report using the template structure (see Appendix B: Evaluation Report Template) with:
- Executive summary with key findings
- Evaluation results analysis
- Prioritized improvement recommendations with evidence
-
IMPORTANT: Follow all HTML comment instructions () in the template when generating content, then remove these comment instructions from the final report - they are template guidance only and should not appear in the generated report.
-
Report completion with key findings and ask: "Would you like me to help implement any of these recommendations?"
Analysis and Reporting Guidelines
Analysis Principles
- Evidence-Based: All insights must be supported by actual execution data
- Actionable: Recommendations must be specific and implementable
- Prioritized: Focus on high-impact improvements first
- Measurable: Include expected outcomes and success metrics
- Realistic: Consider implementation effort and constraints
Red Flags for Simulation
Always check for these indicators of simulated results:
- Identical metrics across different test cases
- Perfect success rates (100%) with large test sets
- Keywords like "simulated", "mocked", "fake" in results
- Lack of natural variation in evaluation scores
Quality Standards for Recommendations
Good Recommendations:
- Cite specific evidence from results
- Include expected impact and effort estimates
- Provide concrete implementation steps
- Address root causes, not just symptoms
- Are feasible given current constraints
Poor Recommendations:
- Make vague suggestions without evidence
- Don't quantify expected improvements
- Focus on symptoms rather than causes
- Are too generic or theoretical
- Ignore practical implementation challenges
Report Quality Standards
Ensure your advisory report:
- Uses data from real agent execution (never simulation)
- Provides specific, actionable recommendations with evidence
- Focuses on evaluation results analysis and insights
- Prioritizes recommendations by impact on evaluation performance
Evaluation Report Template: See Appendix B: Evaluation Report Template
Constraints:
- You MUST follow the exact prompts and analysis framework from the report command above
- You MUST validate results are from real execution (not simulation)
- You MUST generate evidence-based recommendations with specific data
- You MUST prioritize recommendations by impact (Critical/Quality/Enhancement)
- You MUST create eval/eval-report.md with analysis and recommendations
- You MUST ensure the report remains within the eval/ folder
- You MUST update Evaluation Progress section in eval-plan.md
6. Completion and Documentation
Finalize the evaluation and prepare deliverables.
Constraints:
- You MUST ensure eval/README.md exists with complete instructions
- You MUST verify all evaluation artifacts are within eval/
- You MUST confirm the folder structure matches:
<your-evaluation-project>/ # Your chosen project name ├── <your-agent-folder>/ # Your chosen agent folder name │ └── [agent source code] └── eval/ ├── eval-plan.md ├── test-cases.jsonl ├── results/ ├── run_evaluation.py ├── eval-report.md └── README.md - You MUST verify no evaluation files were created inside the agent folder
- You SHOULD create a summary of key findings and recommendations
- You MAY suggest next steps for continuous evaluation
Examples
Example Input
agent_path: "./chatbot-agent"
evaluation_focus: "response quality and tool calling accuracy"
Example Output
Complete Evaluation Flow:
Phase 1 - Planning:
User: "I need to evaluate my customer support chatbot at ./chatbot-agent. Focus on response quality and tool calling accuracy."
Assistant: "I'll create an evaluation plan for your customer support chatbot..."
[Creates eval/eval-plan.md with 2 key metrics and 3 test scenarios]
Phase 2 - Data Generation:
User: "Yes, generate 5 test cases"
Assistant: "I'll generate 5 test cases covering the scenarios..."
[Creates eval/test-cases.jsonl with 2 basic queries, 2 tool-calling scenarios, 1 edge case]
Phase 3 - Evaluation Execution:
User: "Run the evaluation"
Assistant: "I'll implement and execute the evaluation using Strands Evals SDK..."
[Creates eval/run_evaluation.py, runs evaluation]
Results: Overall success rate: 80%, Response Quality: 4.2/5, Tool Call Accuracy: 75%
Phase 4 - Analysis:
User: "Yes, analyze the results"
Assistant: "I'll analyze the evaluation results and generate recommendations..."
[Creates eval/eval-report.md]
Key findings: Strong performance on basic queries (100% success), Tool calling needs improvement (25% failure rate)
Example: Iterative Refinement
User: "Create an evaluation plan for my agent at ./my-agent"
Assistant: [Creates initial plan in eval/eval-plan.md]
User: "Add more focus on error handling"
Assistant: "I'll update the evaluation plan to include error handling metrics..."
[Updates eval/eval-plan.md]
User: "Generate test cases with more edge cases"
Assistant: "I'll generate test cases with additional edge case coverage..."
[Updates eval/test-cases.jsonl]
Example Output Structure
After running all phases, your agent repository will have the following structure:
<your-evaluation-project>/ # Your chosen project name (e.g., my-chatbot-eval)
├── <your-agent-folder>/ # Your chosen agent folder name (e.g., chatbot-agent)
│ └── [existing agent files]
└── eval/ # All evaluation files (sibling to agent folder)
├── eval-plan.md # Complete evaluation specification and plan
├── test-cases.jsonl # Generated test scenarios
├── README.md # Running instructions and usage examples
├── run_evaluation.py # Strands Evals SDK evaluation implementation
├── results/ # Evaluation outputs
│ └── [timestamp]/ # Timestamped evaluation results
└── eval-report.md # Analysis and recommendations
Note:
- All evaluation files are created in the eval/ folder at the same level as your agent folder, keeping evaluation separate from agent code
- The names shown (e.g.,
<your-evaluation-project>,<your-agent-folder>) are placeholders - use any names that fit your project
Conversation Flow Management
Phase Dependencies
EvalKit automatically manages phase dependencies:
- Planning Phase: No dependencies (can start anytime)
- Data Generation Phase: Requires evaluation plan
- Evaluation Phase: Requires test cases
- Reporting Phase: Requires evaluation results
Handling Missing Prerequisites
If a user requests a phase without prerequisites:
Example: User says "run the evaluation" but no test cases exist
Response: "I don't see any test cases yet. Would you like me to:
- Generate test cases based on the existing evaluation plan, or
- Create a new evaluation plan first?"
Conversational Guidance
After completing each phase, suggest the logical next step:
- After Planning: "Would you like me to generate test cases?"
- After Data Generation: "Would you like me to run the evaluation?"
- After Evaluation: "Would you like me to analyze the results?"
- After Reporting: "Would you like help implementing these recommendations?"
Troubleshooting
Common Issues and Solutions
Issue: User requests evaluation but no agent path provided
- Solution: Ask for agent path: "Where is your agent located? Please provide the path to your agent directory."
Issue: Evaluation plan doesn't exist when user requests test generation
- Solution: Offer to create plan first: "I don't see an evaluation plan yet. Would you like me to create one first?"
Issue: Test cases don't exist when user requests evaluation
- Solution: Offer to generate test cases: "I don't see any test cases. Would you like me to generate them based on the evaluation plan?"
Issue: Test data generation fails
- Solution: Ensure eval-plan.md contains valid test data requirements
- Check: Validate JSONL format with
python -m json.tool < eval/test-cases.jsonl - Fix: Update evaluation plan with clearer scenario descriptions
Issue: Evaluation implementation fails with Strands Evals SDK errors
- Solution: Verify Strands Evals SDK is properly installed and configured
- Check: Ensure you're using correct Case, Experiment, and Evaluator patterns
- Fix: Review Strands Evals SDK documentation for correct usage
Issue: Import errors for evaluation dependencies
- Solution: Install required dependencies using uv:
uv pip install -r requirements.txt - Check: Verify Python version is 3.11+
- Check: Ensure virtual environment is activated:
source .venv/bin/activate - Fix: Add missing dependencies to requirements.txt
Issue: Agent execution fails during evaluation
- Solution: Verify agent can be imported and executed properly
- Check: Test agent execution independently before running evaluation
- Fix: Resolve any missing dependencies, API keys, or configuration issues
Issue: User is unsure what to do next
- Solution: Provide clear options: "You can:
- Generate test cases (if plan exists)
- Run the evaluation (if test cases exist)
- Analyze results (if evaluation completed)
- Refine the evaluation plan What would you like to do?"
Appendix A: Evaluation Plan Template
The following template is used for creating eval-plan.md:
# Evaluation Plan for [AGENT NAME]
## 1. Evaluation Requirements
<!--
ACTION REQUIRED: User input and interpreted evaluation requirements. Defaults to 1-2 key metrics if unspecified.
-->
- **User Input:** `"$ARGUMENTS"` or "No Input"
- **Interpreted Evaluation Requirements:** [Parsed from user input - highest priority]
---
## 2. Agent Analysis
| **Attribute** | **Details** |
| :-------------------- | :---------------------------------------------------------- |
| **Agent Name** | [Agent name] |
| **Purpose** | [Primary purpose and use case in 1-2 sentences] |
| **Core Capabilities** | [Key functionalities the agent provides] |
| **Input** | [Short description, Data types, schemas] |
| **Output** | [Short description, Response types, schemas] |
| **Agent Framework** | [e.g., CrewAI, LangGraph, AutoGen, Custom/None] |
| **Technology Stack** | [Programming language, frameworks, libraries, dependencies] |
**Agent Architecture Diagram:**
[Mermaid diagram illustrating:
- Agent components and their relationships
- Data flow between components
- External integrations (APIs, databases, tools)
- User interaction points]
**Key Components:**
- **[Component Name 1]:** [Brief description of purpose and functionality]
- **[Component Name 2]:** [Brief description of purpose and functionality]
- [Additional components as needed]
**Available Tools:**
- **[Tool Name 1]:** [Purpose and usage]
- **[Tool Name 2]:** [Purpose and usage]
- [Additional tools as needed]
**Observability Status**
- **Tracing Framework** [Fully/Partially/Not Instrumented, Framework name, version]
- **Custom Attributes** [Yes/No, Key custom attributes if present]
---
## 3. Evaluation Metrics
<!--
ACTION REQUIRED: If no specific user requirements are provided, use a minimal number of metrics (1-2 metrics) focusing on the most critical aspects of agent performance.
-->
### [Metric Name 1]
- **Evaluation Area:** [Final response quality/tool call accuracy/...]
- **Description:** [What is measured and why]
- **Method:** [Code-based | LLM-as-Judge ]
### [Metric Name 2]
[Repeat for each metric]
---
## 4. Test Data Generation
<!--
ACTION REQUIRED: Keep scenarios minimal and focused. Do not propose more than 3 scenarios.
-->
- **[Test Scenario 1]**: [Description and purpose, complexity]
- **[Test scenario 2]**: [Description and purpose, complexity]
- **Total number of test cases**: [SHOULD NOT exceed 3]
---
## 5. Evaluation Implementation Design
### 5.1 Evaluation Code Structure
<!--
ACTION REQUIRED: The code structure below will be adjusted based on your evaluation requirements and existing agent codebase. This is the recommended starting structure. Only adjust it if necessary.
-->
./ # Repository root directory
├── requirements.txt # Consolidated dependencies
├── .venv/ # Python virtual environment (created by uv)
│
└── eval/ # Evaluation workspace
├── README.md # Running instructions and usage examples (always present)
├── run_evaluation.py # Strands Evals SDK evaluation implementation (always present)
├── results/ # Evaluation outputs (always present)
├── eval-plan.md # This evaluation specification and plan (always present)
└── test-cases.jsonl # Generated test cases (from evalkit.data)
### 5.2 Recommended Evaluation Technical Stack
| **Component** | **Selection** |
| :----------------------- | :------------------------------------------------------------ |
| **Language/Version** | [e.g., Python 3.11, Node.js 18+] |
| **Evaluation Framework** | [Strands Evals SDK (default)] |
| **Evaluators** | [OutputEvaluator, TrajectoryEvaluator, InteractionsEvaluator] |
| **Agent Integration** | [e.g., Direct import, API] |
| **Results Storage** | [e.g., JSON files (default)] |
---
## 6. Progress Tracking
### 6.1 User Requirements Log
| **Timestamp** | **Phase** | **Requirement** |
| :----------------- | :-------- | :------------------------------------------------------------------- |
| [YYYY-MM-DD HH:MM] | Planning | [User input from $ARGUMENTS, or "No specific requirements provided"] |
### 6.2 Evaluation Progress
| **Timestamp** | **Component** | **Status** | **Notes** |
| :----------------- | :--------------- | :------------------------------ | :--------------------------------------------- |
| [YYYY-MM-DD HH:MM] | [Component name] | [In Progress/Completed/Blocked] | [Technical details, blockers, or achievements] |
Appendix B: Evaluation Report Template
The following template is used for creating eval-report.md:
# Agent Evaluation Report for [AGENT NAME]
## Executive Summary
<!--
ACTION REQUIRED: Provide high-level evaluation results and key findings. Focus on actionable insights for stakeholders.
-->
- **Test Scale**: [N] test cases
- **Success Rate**: [XX.X%]
- **Status**: [Excellent/Good/Poor]
- **Strengths**: [Specific capability or metric] [Performance highlight] [Reliability aspect]
- **Critical Issues**: [Blocking issue + impact] [Performance bottleneck] [Safety/compliance concern]
- **Action Priority**: [Critical fixes] [Improvements] [Enhancements]
---
## Evaluation Results
### Test Case Coverage
<!--
ACTION REQUIRED: List all test scenarios that were evaluated, providing context for the results.
-->
- **[Test Scenario 1]**: [Description and coverage]
- **[Test Scenario 2]**: [Description and coverage]
- [Additional scenarios as needed]
### Results
| **Metric** | **Score** | **Target** | **Status** |
| :-------------- | :-------- | :--------- | :---------- |
| [Metric Name 1] | [XX.X%] | [XX%] | [Pass/Fail] |
| [Metric Name 2] | [X.X/5] | [4.0+] | [Pass/Fail] |
| [Metric Name 3] | [XX.X%] | [95%+] | [Pass/Fail] |
### Results Summary
[Brief description of overall performance and findings across metrics]
---
## Agent Success Analysis
<!--
ACTION REQUIRED: Focus on what the agent does well. Provide specific evidence and contributing factors for successful performance.
-->
### Strengths
- **[Strength Name 1]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]
- **[Strength Name 2]**: [What the agent does exceptionally well]
- **Evidence**: [Specific metrics and examples]
- **Contributing Factors**: [Why this works well]
[Repeat pattern for additional strengths]
### High-Performing Scenarios
- **[Scenario Type 1]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]
- **[Scenario Type 2]**: [Category of tasks where agent excels]
- **Key Characteristics**: [What makes these scenarios successful]
[Repeat pattern for additional scenarios]
---
## Agent Failure Analysis
<!--
ACTION REQUIRED: Analyze failures systematically. Provide root cause analysis and specific improvement recommendations with expected impact.
-->
### Issue 1 - [Priority Level]
- **Issue**: [Clear problem statement with evaluation metrics]
- **Root Cause**: [Technical analysis of why this occurred — path/to/file.py:START-END]
- **Evidence**: [Specific data points from results]
- **Impact**: [Effect on overall performance]
- **Priority Fixes**:
- P1 — [Fix name]: [One-line solution] → Expected gain: [Metric +X]
- P2 — [Fix name]: [One-line solution] → Expected gain: [Metric improvement]
### Issue 2 - [Priority Level]
[Repeat structure for additional issues]
---
## Action Items & Recommendations
<!--
ACTION REQUIRED: Provide specific, implementable tasks with clear steps. Prioritize by impact and effort required.
-->
### [Item Name] - Priority [Number] ([Critical/Enhancement])
- **Description**: [Description of this item]
- **Actions**:
- [ ] [Specific task with implementation steps]
- [ ] [Specific task with implementation steps]
- [ ] [Additional tasks as needed]
### [Additional Item Name] - Priority [Number] ([Critical/Enhancement])
[Repeat structure for additional action items]
---
## Artifacts & Reproduction
### Reference Materials
- **Agent Code**: [Path to agent implementation]
- **Test Cases**: [Path to test cases]
- **Traces**: [Path to traces]
- **Results**: [Path to results files]
- **Evaluation Code**: [Path to evaluation implementation]
---
## Evaluation Limitations and Improvement
<!--
ACTION REQUIRED: Identify limitations in the current evaluation approach and suggest improvements for future iterations.
-->
### Test Data Improvement
- **Current Limitations**: [Evaluation scope limitations]
- **Recommended Improvements**: [Specific suggestions for test data enhancement]
### Evaluation Code Enhancement
- **Current Limitations**: [Limitations of evaluation implementation and metrics]
- **Recommended Improvements**: [Specific suggestions for evaluation code improvement]
### [Additional Improvement Area]
[Repeat structure for other evaluation improvement areas]
Didn't find tool you were looking for?