Agent skill
using-braintrust
Enables AI agents to use Braintrust for LLM evaluation, logging, and observability. Includes scripts for querying logs with SQL, running evals, and logging data.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/using-braintrust
SKILL.md
Using Braintrust
Braintrust is a platform for evaluating, logging, and monitoring LLM applications.
Listing projects
Use scripts/list_projects.py to see all available projects:
uv run /path/to/scripts/list_projects.py
Querying logs with SQL
Use the query_logs.py script to run SQL queries against Braintrust logs.
Always share the SQL query you used when reporting results, so the user understands what was executed.
Script location: scripts/query_logs.py (relative to this file)
Run from the user's project directory (where .env with BRAINTRUST_API_KEY exists):
uv run /path/to/scripts/query_logs.py --project "Project Name" --query "SQL_QUERY"
Common queries
Count logs from last 24 hours:
SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day
Get recent logs:
SELECT input, output, created FROM logs ORDER BY created DESC LIMIT 10
Filter by metadata:
SELECT input, output FROM logs WHERE metadata.user_id = 'user123' LIMIT 20
Filter by time range:
SELECT * FROM logs WHERE created > now() - interval 7 day LIMIT 50
Aggregate by field:
SELECT metadata.model, count(*) as count FROM logs GROUP BY metadata.model
Group by hour:
SELECT hour(created) as hr, count(*) as count FROM logs GROUP BY hour(created)
SQL quirks in Braintrust
- Time functions: Use
hour(),day(),month(),year()instead ofdate_trunc()- ✅
hour(created) - ❌
date_trunc('hour', created)
- ✅
- Intervals: Use
interval 1 day,interval 7 day,interval 1 hour(no quotes, singular unit) - Nested fields: Use dot notation:
metadata.user_id,scores.Factuality,metrics.duration - Table name: Always use
FROM logs(the script handles project scoping)
SQL reference
Operators:
=,!=,>,<,>=,<=IS NULL,IS NOT NULLLIKE 'pattern%'AND,OR,NOT
Aggregations:
count(*),count(field)avg(field),sum(field)min(field),max(field)
Time filters:
created > now() - interval 1 daycreated > now() - interval 7 daycreated > now() - interval 1 hour
Logging data
Use scripts/log_data.py to log data to a project:
uv run /path/to/scripts/log_data.py --project "Project Name" --input "query" --output "response"
With metadata:
--input "query" --output "response" --metadata '{"user_id": "123"}'
Batch from JSON:
--data '[{"input": "a", "output": "b"}, {"input": "c", "output": "d"}]'
Running evaluations
Use scripts/run_eval.py to run evaluations:
uv run /path/to/scripts/run_eval.py --project "Project Name" --data '[{"input": "test", "expected": "test"}]'
From file:
--data-file test_cases.json --scorer factuality
Setup
Create a .env file in your project directory:
BRAINTRUST_API_KEY=your-api-key-here
Writing evaluation code (SDK)
For custom evaluation logic, use the SDK directly.
IMPORTANT: First argument to Eval() is the project name (positional).
import braintrust
from autoevals import Factuality
braintrust.Eval(
"My Project", # Project name (required, positional)
data=lambda: [{"input": "What is 2+2?", "expected": "4"}],
task=lambda input: my_llm_call(input),
scores=[Factuality],
)
Common mistakes:
- ❌
Eval(project_name="My Project", ...)- Wrong! - ❌
Eval(name="My Project", ...)- Wrong! - ✅
Eval("My Project", data=..., task=..., scores=...)- Correct!
Writing logging code (SDK)
import braintrust
logger = braintrust.init_logger(project="My Project")
logger.log(input="query", output="response", metadata={"user_id": "123"})
logger.flush() # Always flush!
Common issues
- "Eval() got an unexpected keyword argument 'project_name'": Use positional argument
- Logs not appearing: Call
logger.flush()after logging - Authentication errors: Create
.envfile withBRAINTRUST_API_KEY=your-key
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?