Promptfoo

Promptfoo is a CLI tool for testing and comparing LLM outputs.

Config File

The CLI auto-discovers promptfooconfig.yaml in the current directory. Use -c path for other locations.

Supported extensions: .yaml, .json, .js

Configuration

yaml

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "What this eval tests"

prompts:
  - file://prompt.txt
  - |
    Inline prompt with {{variable}} substitution

providers:
  - anthropic:messages:claude-sonnet-4-5-20250929

defaultTest:
  options:
    provider:
      config:
        temperature: 0.0
        max_tokens: 4096

tests:
  - description: "What this case tests"
    vars:
      variable: "value"
      from_file: file://data/input.txt
    assert:
      - type: contains
        value: "expected substring"

# Or load tests from files
tests: file://cases/all.yaml

outputPath: ./results.json

evaluateOptions:
  maxConcurrency: 4

Provider IDs

Model	ID
Opus 4.5	`anthropic:messages:claude-opus-4-5-20251101`
Sonnet 4.5	`anthropic:messages:claude-sonnet-4-5-20250929`
Haiku 4.5	`anthropic:messages:claude-haiku-4-5-20251001`

Provider config: temperature, max_tokens, top_p, top_k, tools, tool_choice

Prompts

file://path.txt — load from file (path relative to config)
Inline string with {{variable}} Nunjucks substitution
Chat format via JSON: [{"role": "system", "content": "..."}, {"role": "user", "content": "{{input}}"}]

Assertion Types

Type	Use	Value
`contains`	Substring match	`"expected text"`
`icontains`	Case-insensitive substring	`"expected text"`
`equals`	Exact match	`"exact value"`
`regex`	Pattern match	`"\\d{4}-\\d{2}-\\d{2}"`
`is-json`	Valid JSON output	—
`contains-json`	Output contains JSON	—
`starts-with`	Prefix match	`"prefix"`
`cost`	Max cost	`threshold: 0.01`
`latency`	Max response time (ms)	`threshold: 5000`
`javascript`	Custom JS expression	`output.includes('x')`
`python`	Custom Python	`file://check.py:fn_name`
`llm-rubric`	LLM-as-judge	rubric text
`similar`	Semantic similarity	`value: "text"`, `threshold: 0.8`
`model-graded-factuality`	Fact checking	—

Prefix any assertion with not- to negate (e.g., not-contains).

llm-rubric

Uses an LLM to grade output against a rubric:

yaml

assert:
  - type: llm-rubric
    value: |
      The response should:
      - Mention at least 3 factors
      - Include specific examples
    threshold: 0.7
    provider: anthropic:messages:claude-sonnet-4-5-20250929

javascript

Inline expressions or functions. Access output (string) and context (with vars, prompt):

yaml

assert:
  - type: javascript
    value: output.length > 100 && output.includes('route')
  - type: javascript
    value: |
      const data = JSON.parse(output);
      return data.calories >= 200 && data.calories <= 300;

Test Organization

Split cases into separate files and reference them:

yaml

tests:
  - file://cases/basic.yaml
  - file://cases/edge-cases.yaml

Each case file contains a YAML array of test objects.

CLI

bash

npx promptfoo eval                         # Run with auto-discovered config
npx promptfoo eval -c path/to/config.yaml  # Specific config
npx promptfoo eval --filter-metadata key=v # Filter tests
npx promptfoo view                         # Web UI for results
npx promptfoo cache clear                  # Clear result cache

References

Consult the configuration reference and Anthropic provider docs for full details.

Search AI Tools

promptfoo

Install this agent skill to your Project

SKILL.md

Promptfoo

Config File

Configuration

Provider IDs

Prompts

Assertion Types

llm-rubric

javascript

Test Organization

CLI

References