Agent skill

debug-stuck-eval

Debug stuck Hawk/Inspect AI evaluations. Use when user mentions "stuck eval", "eval not progressing", "eval hanging", "samples not completing", "eval set frozen", "runner stuck", "500 errors in eval", "retry loop", "eval timeout", or asks why an evaluation isn't finishing.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/debug-stuck-eval

SKILL.md

Quick Checklist

Verify auth: hawk auth access-token > /dev/null || echo "Run 'hawk login' first"
Get eval-set-id from user
Check status: hawk status <eval-set-id> - JSON report with pod state, logs, metrics
View logs: hawk logs <eval-set-id> or hawk logs -f for follow mode
List samples: hawk list samples <eval-set-id> - see completion status
Look for error patterns (see below)
Test API directly if logs show retries without clear errors

Error Patterns

Log Pattern	Meaning	Resolution
`Retrying request to /responses`	OpenAI SDK hiding actual error	Test API directly with curl to see real error
`500 - Internal server error`	API issue	Download buffer, find failing request, test through middleman AND directly to provider
`400 - invalid_request_error`	Token/context limit exceeded	Check message count and model context window
`Pod UID mismatch`	Sandbox pod was killed and restarted	No fix needed—sample errored out, Inspect will retry
Empty output, `pending: true`	API returned malformed response	Restart eval (buffer resumes)
OOMKilled in pod status	Memory exhaustion	Increase pod memory limits

Key Techniques

SDK hides errors by design - The OpenAI SDK hides transient errors during retry backoff. "Retrying request" logs don't show the actual error. Use curl to see real errors.
FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
Use S3 for buffer access - Download .buffer/ from S3 rather than accessing the runner pod directly.
Read .eval files with inspect_ai - Use from inspect_ai.log import read_eval_log instead of manually extracting zips.

Test API Directly

Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.

bash

TOKEN=$(hawk auth access-token)

# Test through middleman
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}'

# Test OpenAI-compatible
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'

Recovery

bash

# Delete stuck eval and restart
hawk delete <eval-set-id>
hawk eval-set <config.yaml>

The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).

HTTP Retry Count

Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.

Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.

More Details

See docs/debugging-stuck-evals.md for:

Sample buffer SQL queries
Detailed API testing examples
Escalation checklist

References

Inspect AI Model Providers - Model configuration
Inspect AI Eval Logs - .eval file format

Filing Issues

Middleman: https://github.com/metr-middleman/middleman-server/issues
Hawk: Linear issue on Evals Execution team
Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai/issues

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/debug-stuck-eval
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Quick Checklist

Error Patterns

Key Techniques

Test API Directly

Recovery

HTTP Retry Count

More Details

References

Filing Issues

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state