Agent skill
agent-test
Run automated tests for LLM agents using `tdx agent test`. Covers test.yml format with user_input/criteria, single and multi-round tests, evaluation by judge agent, and criteria development workflow.
Install this agent skill to your Project
npx add-skill https://github.com/treasure-data/td-skills/tree/main/tdx-skills/agent-test
SKILL.md
tdx Agent Test
Run automated tests against agents using YAML test definitions. Tests are evaluated by a judge agent for binary pass/fail results.
Commands
# Run tests from current agent directory
tdx agent test
# Run tests from specific path
tdx agent test ./agents/my-project/my-agent/
# Run specific tests by name
tdx agent test --name "greeting_test" --name "context_test"
# Run tests with specific tags
tdx agent test --tags "smoke,regression"
# Validate test file without running
tdx agent test --dry-run
# Run without evaluation (just execute conversations)
tdx agent test --no-eval
# Re-evaluate last test run with updated criteria
tdx agent test --reeval
Test File Structure
Create test.yml in your agent directory:
agents/{project-name}/{agent-name}/
├── agent.yml
├── prompt.md
└── test.yml
test.yml Format
Single-Round Tests (Flat Format)
tests:
- name: greeting_test
tags: [smoke, core]
user_input: Hello
criteria: Should respond with a friendly greeting
- name: calculation_test
tags: [regression]
user_input: What is 2 + 2?
criteria: Should respond with the correct answer (4)
Multi-Round Tests (Rounds Format)
tests:
- name: context_memory_test
tags: [memory, core]
rounds:
- user_input: My name is Alice
criteria: Should acknowledge the name
- user_input: What's my name?
criteria: Should remember and respond with "Alice"
- name: multi_step_task
rounds:
- user_input: I want to analyze sales data
criteria: Should ask clarifying questions about the data
- user_input: It's in the sales_2024 table
criteria: Should acknowledge and proceed with analysis
Writing Good Criteria
Criteria are evaluated by a judge agent. Be specific and measurable:
# Good - specific and measurable
criteria: Should respond with the number 4
# Good - describes expected behavior
criteria: Should ask for the customer's email address before proceeding
# Good - includes negative constraints
criteria: Should provide help without mentioning competitor products
# Bad - too vague
criteria: Should give a good response
# Bad - subjective
criteria: Should be helpful and friendly
Writing Good Test Cases
Test Core Functionality
tests:
- name: primary_use_case
user_input: Help me with a billing question
criteria: Should ask clarifying questions about the billing issue
Re-evaluation Workflow
Iterate on criteria without re-running conversations:
# 1. Run tests to generate conversations
tdx agent test
# 2. Edit criteria in test.yml
# 3. Re-evaluate with cached conversations
tdx agent test --reeval
Cache is stored in .cache/tdx/last_agent_test_run.json.
Options
| Option | Description |
|---|---|
--name <name> |
Filter to specific test(s) by name (can repeat) |
--tags <tags> |
Filter to tests with specific tags (comma-separated) |
--dry-run |
Parse and validate without running |
--no-eval |
Run conversations without evaluation |
--reeval |
Re-evaluate last run with updated criteria |
Related Skills
- agent - Agent configuration and
pull/pushworkflow - agent-prompt - Writing effective system prompts
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
email-campaign
This skill should be used when the user asks to "create an email", "build an email campaign", "design an email template", "generate an email for a segment", "preview an email", or "push an email to Engage". Generates enterprise-grade HTML email templates with live preview in Treasure Studio and natural language editing, then pushes the final version to Treasure Engage.
action-report
YAML format reference for action reports rendered via preview_action_report. MUST be read before writing any action report YAML — defines the report structure (title, summary, actions array) and action item fields (as_is, to_be, reason, priority, category, impact) with incremental build workflow. Required by seo-analysis and any skill that produces prioritized recommendations.
grid-dashboard
YAML format reference for grid dashboards rendered via preview_grid_dashboard. MUST be read before writing any dashboard YAML — defines the page structure, 6 cell types (kpi, gauge, scores, table, chart, markdown), grid layout rules, cell merging syntax, and incremental build workflow. Required by seo-analysis and any skill that produces visual data dashboards.
seo-analysis
Runs SEO and AEO (Answer Engine Optimization) analysis on websites or specific pages. Use when the user mentions SEO, AEO, search rankings, search optimization, or wants to analyze how their pages perform in search engines and AI answers. Produces a data dashboard and action report with before/after recommendations.
aps-doc-core
Core documentation generation patterns and framework for Treasure Data pipeline layers. Provides shared templates, quality validation, testing framework, and Confluence integration used by all layer-specific documentation skills.
aps-doc-id-unification
Expert documentation generation for ID unification layers. Documents identity resolution algorithms, merge strategies, match rules, entity graphs, and multi-workflow orchestration. Use when documenting ID unification processes.
Didn't find tool you were looking for?