Agent skill

clade-incident-runbook

Respond to Anthropic API incidents — outages, degraded performance, Use when working with incident-runbook patterns. error spikes, and rate limit issues in production. Trigger with "anthropic down", "claude outage", "anthropic incident", "claude not responding", "anthropic 529".

Stars 1,803
Forks 241

Install this agent skill to your Project

npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/claude-pack/skills/clade-incident-runbook

SKILL.md

Anthropic Incident Runbook

Overview

Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.

Step 1: Confirm the Issue

bash
# Check Anthropic status
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")"

# Test API directly
curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "claude-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'

Step 2: Classify Severity

Symptom Severity Action
529 overloaded (intermittent) Low SDK auto-retries handle this
529 overloaded (sustained 5+ min) Medium Switch to fallback model
401/403 on all requests High API key issue — check console
All requests timing out High Check status page, activate fallback
Status page shows incident Varies Follow status page updates

Step 3: Activate Fallback

typescript
async function callWithFallback(params: Anthropic.MessageCreateParams) {
  try {
    return await client.messages.create(params);
  } catch (err) {
    if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) {
      // Try a different model
      if (params.model.includes('opus')) {
        return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' });
      }
      if (params.model.includes('sonnet')) {
        return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' });
      }
    }
    throw err;
  }
}

Step 4: Communicate

  • Update your status page if user-facing
  • Note: Anthropic incidents typically resolve in 15-60 minutes

Step 5: Post-Incident

  • Check your error logs for the incident window
  • Calculate impact (failed requests, user impact)
  • Verify all systems recovered

Output

  • Incident confirmed via status page and direct API test
  • Severity classified (Low/Medium/High) based on symptoms
  • Fallback activated if needed (downgrade model or queue requests)
  • Impact assessed and documented post-incident

Error Handling

Error Cause Solution
API Error Check error type and status code See clade-common-errors

Examples

See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.

Resources

Next Steps

See clade-reliability-patterns for building resilient integrations.

Prerequisites

  • Production Claude integration deployed
  • Fallback model configuration in place (see clade-reliability-patterns)
  • Monitoring/alerting configured (see clade-observability)

Instructions

Step 1: Review the patterns below

Each section contains production-ready code examples. Copy and adapt them to your use case.

Step 2: Apply to your codebase

Integrate the patterns that match your requirements. Test each change individually.

Step 3: Verify

Run your test suite to confirm the integration works correctly.

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results