Agent skill

read-eval-logs

View and analyse Inspect evaluation log files using the Python API

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/read-eval-logs

SKILL.md

Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

Quick Reference

CLI Commands

bash

# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema

Interactive Log Viewer

bash

# Start the interactive log viewer (updates automatically)
uv run inspect view

Python API

Key Imports

python

from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)

Listing Logs

python

# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)

Reading Logs

python

# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)

Reading Samples

python

# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")

Filtering Samples

python

# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )

EvalLog Structure

The EvalLog object contains:

Field	Type	Description
`version`	`int`	File format version (currently 2)
`status`	`str`	`"started"`, `"success"`, `"cancelled"`, or `"error"`
`eval`	`EvalSpec`	Task, model, creation time, config
`plan`	`EvalPlan`	Solvers and generation config
`results`	`EvalResults`	Aggregate scores and metrics
`stats`	`EvalStats`	Runtime, model usage statistics
`error`	`EvalError`	Error info if `status == "error"`
`samples`	`list[EvalSample]`	Individual samples (if not header_only)
`reductions`	`list[EvalSampleReductions]`	Multi-epoch reductions
`location`	`str`	URI where log was read from

Always Check Status

python

log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")

EvalSample Structure

Each sample contains:

Field	Type	Description
`id`	`int \| str`	Unique sample ID
`epoch`	`int`	Epoch number
`input`	`str \| list[ChatMessage]`	Sample input
`target`	`str \| list[str]`	Expected target(s)
`messages`	`list[ChatMessage]`	Full conversation history
`output`	`ModelOutput`	Model's output
`scores`	`dict[str, Score]`	Scores from scorers
`metadata`	`dict[str, Any]`	Sample metadata
`store`	`dict[str, Any]`	State at end of execution
`events`	`list[Event]`	Transcript events
`error`	`EvalError`	Error if sample failed
`total_time`	`float`	Total sample runtime
`model_usage`	`dict[str, ModelUsage]`	Token usage

Common Analysis Patterns

Get Aggregate Metrics

python

log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")

Analyse Failed Samples

python

log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")

Extract Model Usage

python

log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")

Compare Multiple Runs

python

logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")

Find Retryable Logs

python

all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")

Log File Formats

Type	Description
`.eval`	Binary format, ~1/8 size of JSON, fast incremental access
`.json`	Text format, human-readable, slower for large files

Both formats are fully supported by the API and can be intermixed.

Working with Large Logs

For logs too large to fit in memory:

Use .eval format - supports compression and incremental access
Read header only - read_eval_log(log_file, header_only=True)
Stream samples - read_eval_log_samples() yields one at a time
Use summaries - read_eval_log_sample_summaries() for quick overview

Modifying Logs

python

# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)

Environment Variables

Variable	Description
`INSPECT_LOG_DIR`	Default log directory (default: `./logs`)
`INSPECT_EVAL_LOG_FILE_PATTERN`	Log filename pattern (e.g., `{task}_{model}_{id}`)

Related Tools

Inspect Scout - Transcript analysis
Inspect Viz - Data visualization
Log Dataframes - Extract pandas DataFrames from logs (see inspect_ai.analysis)

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/read-eval-logs
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Analysing Eval Log Files

Quick Reference

CLI Commands

Interactive Log Viewer

Python API

Key Imports

Listing Logs

Reading Logs

Reading Samples

Filtering Samples

EvalLog Structure

Always Check Status

EvalSample Structure

Common Analysis Patterns

Get Aggregate Metrics

Analyse Failed Samples

Extract Model Usage

Compare Multiple Runs

Find Retryable Logs

Log File Formats

Working with Large Logs

Modifying Logs

Environment Variables

Related Tools

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state