Agent skill

agent-test

Run automated tests for LLM agents using `tdx agent test`. Covers test.yml format with user_input/criteria, single and multi-round tests, evaluation by judge agent, and criteria development workflow.

Stars 16
Forks 23

Install this agent skill to your Project

npx add-skill https://github.com/treasure-data/td-skills/tree/main/tdx-skills/agent-test

SKILL.md

tdx Agent Test

Run automated tests against agents using YAML test definitions. Tests are evaluated by a judge agent for binary pass/fail results.

Commands

bash
# Run tests from current agent directory
tdx agent test

# Run tests from specific path
tdx agent test ./agents/my-project/my-agent/

# Run specific tests by name
tdx agent test --name "greeting_test" --name "context_test"

# Run tests with specific tags
tdx agent test --tags "smoke,regression"

# Validate test file without running
tdx agent test --dry-run

# Run without evaluation (just execute conversations)
tdx agent test --no-eval

# Re-evaluate last test run with updated criteria
tdx agent test --reeval

Test File Structure

Create test.yml in your agent directory:

agents/{project-name}/{agent-name}/
├── agent.yml
├── prompt.md
└── test.yml

test.yml Format

Single-Round Tests (Flat Format)

yaml
tests:
  - name: greeting_test
    tags: [smoke, core]
    user_input: Hello
    criteria: Should respond with a friendly greeting

  - name: calculation_test
    tags: [regression]
    user_input: What is 2 + 2?
    criteria: Should respond with the correct answer (4)

Multi-Round Tests (Rounds Format)

yaml
tests:
  - name: context_memory_test
    tags: [memory, core]
    rounds:
      - user_input: My name is Alice
        criteria: Should acknowledge the name
      - user_input: What's my name?
        criteria: Should remember and respond with "Alice"

  - name: multi_step_task
    rounds:
      - user_input: I want to analyze sales data
        criteria: Should ask clarifying questions about the data
      - user_input: It's in the sales_2024 table
        criteria: Should acknowledge and proceed with analysis

Writing Good Criteria

Criteria are evaluated by a judge agent. Be specific and measurable:

yaml
# Good - specific and measurable
criteria: Should respond with the number 4

# Good - describes expected behavior
criteria: Should ask for the customer's email address before proceeding

# Good - includes negative constraints
criteria: Should provide help without mentioning competitor products

# Bad - too vague
criteria: Should give a good response

# Bad - subjective
criteria: Should be helpful and friendly

Writing Good Test Cases

Test Core Functionality

yaml
tests:
  - name: primary_use_case
    user_input: Help me with a billing question
    criteria: Should ask clarifying questions about the billing issue

Re-evaluation Workflow

Iterate on criteria without re-running conversations:

bash
# 1. Run tests to generate conversations
tdx agent test

# 2. Edit criteria in test.yml

# 3. Re-evaluate with cached conversations
tdx agent test --reeval

Cache is stored in .cache/tdx/last_agent_test_run.json.

Options

Option Description
--name <name> Filter to specific test(s) by name (can repeat)
--tags <tags> Filter to tests with specific tags (comma-separated)
--dry-run Parse and validate without running
--no-eval Run conversations without evaluation
--reeval Re-evaluate last run with updated criteria

Related Skills

  • agent - Agent configuration and pull/push workflow
  • agent-prompt - Writing effective system prompts

Expand your agent's capabilities with these related and highly-rated skills.

treasure-data/td-skills

email-campaign

This skill should be used when the user asks to "create an email", "build an email campaign", "design an email template", "generate an email for a segment", "preview an email", or "push an email to Engage". Generates enterprise-grade HTML email templates with live preview in Treasure Studio and natural language editing, then pushes the final version to Treasure Engage.

16 23
Explore
treasure-data/td-skills

action-report

YAML format reference for action reports rendered via preview_action_report. MUST be read before writing any action report YAML — defines the report structure (title, summary, actions array) and action item fields (as_is, to_be, reason, priority, category, impact) with incremental build workflow. Required by seo-analysis and any skill that produces prioritized recommendations.

16 23
Explore
treasure-data/td-skills

grid-dashboard

YAML format reference for grid dashboards rendered via preview_grid_dashboard. MUST be read before writing any dashboard YAML — defines the page structure, 6 cell types (kpi, gauge, scores, table, chart, markdown), grid layout rules, cell merging syntax, and incremental build workflow. Required by seo-analysis and any skill that produces visual data dashboards.

16 23
Explore
treasure-data/td-skills

seo-analysis

Runs SEO and AEO (Answer Engine Optimization) analysis on websites or specific pages. Use when the user mentions SEO, AEO, search rankings, search optimization, or wants to analyze how their pages perform in search engines and AI answers. Produces a data dashboard and action report with before/after recommendations.

16 23
Explore
treasure-data/td-skills

aps-doc-core

Core documentation generation patterns and framework for Treasure Data pipeline layers. Provides shared templates, quality validation, testing framework, and Confluence integration used by all layer-specific documentation skills.

16 23
Explore
treasure-data/td-skills

aps-doc-id-unification

Expert documentation generation for ID unification layers. Documents identity resolution algorithms, merge strategies, match rules, entity graphs, and multi-workflow orchestration. Use when documenting ID unification processes.

16 23
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results