Agent skill
testing-ai-agents
Use when testing AI agent code with pytest. Covers TDD for agent APIs, mocking LLM calls (NOT evaluating LLM outputs), pytest-asyncio patterns, FastAPI testing with httpx, SQLModel testing, and agent tool testing. NOT for evaluating LLM reasoning quality (use evals skill).
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/testing-ai-agents
SKILL.md
Testing AI Agents: TDD for Agent Code
Test-Driven Development for agent applications. This skill covers testing code correctness (deterministic, passes/fails), NOT measuring LLM reasoning quality (probabilistic, scores - use evals for that).
Critical Distinction: TDD vs Evals
| Aspect | TDD (This Skill) | Evals (Chapter 47) |
|---|---|---|
| Question | Does the code work correctly? | Does the LLM reason well? |
| Nature | Deterministic | Probabilistic |
| Output | Pass/Fail | Scores (0-1) |
| Tests | Functions, APIs, DB operations | Response quality, faithfulness |
| Speed | Fast (mocked LLM) | Slow (real LLM calls) |
| Cost | Zero (no API calls) | High (API calls required) |
Quick Start: Project Setup
bash
# Install testing dependencies
uv add --dev pytest pytest-asyncio httpx respx pytest-cov
# Configure pytest
cat > pyproject.toml << 'EOF'
[tool.pytest.ini_options]
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
testpaths = ["tests"]
EOF
Core Testing Patterns
Pattern 1: Async Test Setup
python
# tests/conftest.py
import os
import pytest
from httpx import ASGITransport, AsyncClient
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker
from sqlalchemy.pool import StaticPool
from sqlmodel import SQLModel
from sqlmodel.ext.asyncio.session import AsyncSession
# Set environment FIRST
os.environ.setdefault("DATABASE_URL", "sqlite+aiosqlite:///:memory:")
os.environ.setdefault("OPENAI_API_KEY", "test-key-not-used")
from app.main import app
from app.database import get_session
from app.auth import get_current_user
# Test database
TEST_DATABASE_URL = "sqlite+aiosqlite:///:memory:"
test_engine = create_async_engine(
TEST_DATABASE_URL,
echo=False,
poolclass=StaticPool,
connect_args={"check_same_thread": False},
)
TestAsyncSession = async_sessionmaker(
test_engine,
class_=AsyncSession,
expire_on_commit=False,
)
# Mock user
TEST_USER = {"sub": "test-user-123", "email": "test@example.com"}
@pytest.fixture(scope="session")
def event_loop():
"""Create event loop for session-scoped fixtures."""
import asyncio
loop = asyncio.get_event_loop_policy().new_event_loop()
yield loop
loop.close()
@pytest.fixture(autouse=True)
async def setup_database():
"""Create tables before each test, drop after."""
async with test_engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.create_all)
yield
async with test_engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.drop_all)
async def get_test_session():
async with TestAsyncSession() as session:
yield session
def get_test_user():
return TEST_USER
@pytest.fixture
async def client():
"""Async test client with mocked dependencies."""
app.dependency_overrides[get_session] = get_test_session
app.dependency_overrides[get_current_user] = get_test_user
async with AsyncClient(
transport=ASGITransport(app=app),
base_url="http://test",
) as ac:
yield ac
app.dependency_overrides.clear()
Pattern 2: Testing FastAPI Endpoints
python
# tests/test_tasks.py
import pytest
from httpx import AsyncClient
@pytest.mark.asyncio
async def test_create_task(client: AsyncClient):
"""Test creating a task via API."""
response = await client.post(
"/api/tasks",
json={"title": "Test Task", "priority": "high"},
)
assert response.status_code == 201
data = response.json()
assert data["title"] == "Test Task"
assert data["priority"] == "high"
assert data["status"] == "pending"
@pytest.mark.asyncio
async def test_get_task_not_found(client: AsyncClient):
"""Test 404 for non-existent task."""
response = await client.get("/api/tasks/99999")
assert response.status_code == 404
@pytest.mark.asyncio
async def test_list_tasks_with_filter(client: AsyncClient):
"""Test filtering tasks by status."""
# Create test data
await client.post("/api/tasks", json={"title": "Task 1"})
await client.post("/api/tasks", json={"title": "Task 2"})
# Filter by status
response = await client.get("/api/tasks", params={"status": "pending"})
assert response.status_code == 200
data = response.json()
assert len(data) == 2
Pattern 3: Testing SQLModel Operations
python
# tests/test_models.py
import pytest
from sqlmodel.ext.asyncio.session import AsyncSession
from app.models import Task, Project
@pytest.fixture
async def session():
"""Direct database session for model testing."""
async with TestAsyncSession() as session:
yield session
@pytest.mark.asyncio
async def test_create_task(session: AsyncSession):
"""Test Task model creation."""
task = Task(title="Test", priority="high")
session.add(task)
await session.commit()
await session.refresh(task)
assert task.id is not None
assert task.created_at is not None
@pytest.mark.asyncio
async def test_cascade_delete(session: AsyncSession):
"""Test parent-child cascade deletion."""
project = Project(name="Test Project")
session.add(project)
await session.commit()
task = Task(title="Test", project_id=project.id)
session.add(task)
await session.commit()
# Delete parent
await session.delete(project)
await session.commit()
# Verify child deleted
result = await session.get(Task, task.id)
assert result is None
Pattern 4: Mocking LLM Calls with respx
python
# tests/test_agent_tools.py
import pytest
import respx
import httpx
from app.agent import call_openai
@pytest.mark.asyncio
@respx.mock
async def test_openai_completion():
"""Mock OpenAI API response."""
# Mock the API endpoint
respx.post("https://api.openai.com/v1/chat/completions").mock(
return_value=httpx.Response(
200,
json={
"choices": [{
"message": {
"role": "assistant",
"content": "Hello, I can help with that!"
}
}],
"usage": {"total_tokens": 50}
}
)
)
# Call your function
result = await call_openai("Say hello")
assert "Hello" in result
assert respx.calls.call_count == 1
@pytest.mark.asyncio
@respx.mock
async def test_openai_rate_limit():
"""Test rate limit handling."""
respx.post("https://api.openai.com/v1/chat/completions").mock(
return_value=httpx.Response(429, json={"error": "Rate limited"})
)
with pytest.raises(RateLimitError):
await call_openai("Test")
@pytest.mark.asyncio
@respx.mock
async def test_tool_call_parsing():
"""Test agent parses tool calls correctly."""
respx.post("https://api.openai.com/v1/chat/completions").mock(
return_value=httpx.Response(
200,
json={
"choices": [{
"message": {
"role": "assistant",
"tool_calls": [{
"id": "call_123",
"function": {
"name": "get_weather",
"arguments": '{"city": "London"}'
}
}]
}
}]
}
)
)
result = await agent.process("What's the weather in London?")
assert result.tool_calls[0].function.name == "get_weather"
assert result.tool_calls[0].function.arguments["city"] == "London"
Pattern 5: Using pytest-mockllm
python
# tests/test_with_mockllm.py
import pytest
def test_anthropic_mock(mock_anthropic):
"""Test with pytest-mockllm for Anthropic."""
mock_anthropic.add_response("I can help with that task!")
from anthropic import Anthropic
client = Anthropic(api_key="fake")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
assert "help" in response.content[0].text
def test_openai_mock(mock_openai):
"""Test with pytest-mockllm for OpenAI."""
mock_openai.add_response("Task completed successfully.")
from openai import OpenAI
client = OpenAI(api_key="fake")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Complete this task"}]
)
assert "completed" in response.choices[0].message.content
Pattern 6: Testing Agent Tools in Isolation
python
# tests/test_tools.py
import pytest
from app.tools import search_database, format_response, validate_input
@pytest.mark.asyncio
async def test_search_tool():
"""Test database search tool function."""
# This tests the tool logic, NOT the LLM
results = await search_database(query="python")
assert isinstance(results, list)
assert all("python" in r["title"].lower() for r in results)
def test_format_response():
"""Test response formatting utility."""
raw = {"items": [1, 2, 3], "count": 3}
formatted = format_response(raw)
assert "3 items found" in formatted
def test_validate_input_rejects_injection():
"""Test input validation blocks SQL injection."""
malicious = "'; DROP TABLE users; --"
with pytest.raises(ValidationError):
validate_input(malicious)
Pattern 7: Integration Tests with Mocked LLM
python
# tests/integration/test_agent_pipeline.py
import pytest
import respx
import httpx
@pytest.mark.asyncio
@respx.mock
async def test_complete_agent_flow(client: AsyncClient):
"""Test full agent pipeline with mocked LLM."""
# Mock LLM to return a tool call
respx.post("https://api.openai.com/v1/chat/completions").mock(
side_effect=[
# First call: LLM decides to use tool
httpx.Response(200, json={
"choices": [{
"message": {
"role": "assistant",
"tool_calls": [{
"id": "call_1",
"function": {
"name": "create_task",
"arguments": '{"title": "New Task"}'
}
}]
}
}]
}),
# Second call: LLM responds with result
httpx.Response(200, json={
"choices": [{
"message": {
"role": "assistant",
"content": "I created the task 'New Task' for you."
}
}]
})
]
)
# Call agent endpoint
response = await client.post(
"/api/agent/chat",
json={"message": "Create a task called 'New Task'"}
)
assert response.status_code == 200
data = response.json()
assert "created" in data["response"].lower()
# Verify task was actually created in DB
tasks = await client.get("/api/tasks")
assert any(t["title"] == "New Task" for t in tasks.json())
Pattern 8: Testing Error Handling
python
# tests/test_error_handling.py
import pytest
import respx
import httpx
@pytest.mark.asyncio
@respx.mock
async def test_llm_timeout_handling():
"""Test graceful handling of LLM timeout."""
respx.post("https://api.openai.com/v1/chat/completions").mock(
side_effect=httpx.TimeoutException("Connection timed out")
)
with pytest.raises(AgentTimeoutError) as exc_info:
await agent.process("Test query")
assert "LLM request timed out" in str(exc_info.value)
@pytest.mark.asyncio
@respx.mock
async def test_malformed_response_handling():
"""Test handling of malformed LLM response."""
respx.post("https://api.openai.com/v1/chat/completions").mock(
return_value=httpx.Response(200, json={"invalid": "response"})
)
with pytest.raises(AgentResponseError):
await agent.process("Test query")
@pytest.mark.asyncio
async def test_database_error_handling(client: AsyncClient):
"""Test API handles database errors gracefully."""
# Force a constraint violation
await client.post("/api/tasks", json={"title": "Task 1"})
response = await client.post("/api/tasks", json={"title": "Task 1"}) # Duplicate
assert response.status_code == 400
assert "already exists" in response.json()["error"]
Test Organization
tests/
├── conftest.py # Shared fixtures
├── unit/
│ ├── test_models.py # SQLModel tests
│ ├── test_tools.py # Agent tool tests
│ └── test_utils.py # Utility function tests
├── integration/
│ ├── test_api.py # FastAPI endpoint tests
│ └── test_agent.py # Agent pipeline tests (mocked LLM)
└── e2e/
└── test_flows.py # End-to-end flows (still mocked LLM)
Fixtures Reference
| Fixture | Scope | Purpose |
|---|---|---|
event_loop |
session | Shared async event loop |
setup_database |
function | Fresh DB per test |
session |
function | Direct DB access |
client |
function | Async HTTP client |
mock_user |
function | Test authentication |
Best Practices
DO
- Mock LLM calls at HTTP level (respx, httpx.MockTransport)
- Use in-memory SQLite for fast DB tests
- Test tool logic separately from LLM orchestration
- Override FastAPI dependencies for auth/DB
- Use factories for test data creation
DON'T
- Make real LLM API calls in unit tests
- Share state between tests
- Test LLM reasoning quality (that's evals)
- Skip error path testing
- Use production databases
Running Tests
bash
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test file
pytest tests/test_tasks.py
# Run tests matching pattern
pytest -k "test_create"
# Run async tests only
pytest -m asyncio
# Verbose output
pytest -v
CI/CD Integration
yaml
# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv sync --all-extras
- run: uv run pytest --cov --cov-report=xml
- uses: codecov/codecov-action@v4
References
For detailed patterns, see:
Didn't find tool you were looking for?