Agent skill
behavioral-evals
Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.
Install this agent skill to your Project
npx add-skill https://github.com/google-gemini/gemini-cli/tree/main/.gemini/skills/behavioral-evals
SKILL.md
Behavioral Evals
Overview
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
🔄 Workflow Decision Tree
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Yes -> Use
appEvalTest(AppRig). See creating.md. - No -> Use
evalTest(TestRig). See creating.md.
- Yes -> Use
- Is it a new test?
- Yes -> Set policy to
USUALLY_PASSES. - No ->
ALWAYS_PASSES(locks in regression).
- Yes -> Set policy to
- Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.
📋 Quick Checklist
1. Setup Workspace
Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).
- Details in creating.md
2. Write Assertions
Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().
- Details in creating.md
3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
- See evals/README.md for running commands.
📦 Bundled Resources
Detailed procedural guides:
- creating.md: Assertion strategies, Rig selection, Mock MCPs.
- fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
- promoting.md: Candidate identification criteria and threshold guidelines.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Gemini CLI's capabilities with specialized knowledge, workflows, or tool integrations.
pirate-skill
Speak like a pirate.
greeter
A friendly greeter skill
ci
A specialized skill for Gemini CLI that provides high-performance, fail-fast monitoring of GitHub Actions workflows and automated local verification of CI failures. It handles run discovery automatically—simply provide the branch name.
pr-address-comments
Use this skill if the user asks you to help them address GitHub PR comments for their current branch of the Gemini CLI. Requires `gh` CLI tool.
review-duplication
Use this skill during code reviews to proactively investigate the codebase for duplicated functionality, reinvented wheels, or failure to reuse existing project best practices and shared utilities.
Didn't find tool you were looking for?