Agent skill

error-recovery

Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state

View SKILL.md on GitHub Repository

Stars 5

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/troykelly/codex-skills/tree/main/skills/error-recovery

SKILL.md

Error Recovery

Overview

Handle failures gracefully with structured recovery.

Core principle: When things break, don't panic. Assess, preserve, recover, verify.

Announce at start: "I'm using error-recovery to handle this failure."

The Recovery Protocol

Error Detected
      │
      ▼
┌─────────────┐
│ 1. ASSESS   │ ← Severity? Scope? Impact?
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 2. PRESERVE │ ← Capture evidence before it's lost
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 3. RECOVER  │ ← Follow decision tree
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 4. VERIFY   │ ← Confirm clean state
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ 5. DOCUMENT │ ← Record what happened
└─────────────┘

Step 1: Assess Severity

Severity Levels

Level	Description	Examples
Critical	System unusable, data at risk	Build completely broken, tests cause data loss
Major	Significant functionality broken	Feature doesn't work, many tests failing
Minor	Isolated issue, workaround exists	Single test flaky, style error
Info	Warning only, not blocking	Deprecation notice, performance hint

Assessment Questions

markdown

## Error Assessment

**Error:** [Description of error]
**Location:** [Where it occurred]

### Severity Checklist
- [ ] Is the system still functional?
- [ ] Is any data at risk?
- [ ] Are other features affected?
- [ ] Is this blocking progress?

### Scope
- Files affected: [list]
- Features affected: [list]
- Users affected: [none/some/all]

Step 2: Preserve Evidence

Capture BEFORE attempting fixes:

Error Logs

bash

# Capture error output
pnpm test 2>&1 | tee error-log.txt

# Or from failed command
./failing-command 2>&1 | tee error-log.txt

Stack Traces

markdown

## Stack Trace

Error: Connection refused at Database.connect (src/db/connection.ts:45) at UserService.init (src/services/user.ts:23) at main (src/index.ts:12)

State Capture

bash

# Git state
git status
git diff

# Environment state
env | grep -E "NODE|NPM|PATH"

# Dependency state
pnpm list

Screenshot (if visual)

For UI errors, capture screenshots before changes.

Step 3: Recover

Decision Tree

What type of failure?
         │
    ┌────┴────┬────────────┬────────────┐
    │         │            │            │
  Code      Build      Environment   External
  Error     Error        Issue       Service
    │         │            │            │
    ▼         ▼            ▼            ▼
  ┌────┐   ┌────┐      ┌────┐      ┌────┐
  │Git │   │Clean│     │Re-  │     │Wait/│
  │reco│   │build│     │init │     │Retry│
  │very│   │     │     │     │     │     │
  └────┘   └────┘      └────┘      └────┘

Code Error Recovery

Single file broken:

bash

# Revert just that file
git checkout HEAD -- path/to/file.ts

Feature broken (multiple files):

bash

# Find last good commit
git log --oneline

# Revert to that commit (soft reset keeps changes staged)
git reset --soft [GOOD_COMMIT]

# Or hard reset (discards changes)
git reset --hard [GOOD_COMMIT]

Working directory is a mess:

bash

# Stash current changes
git stash

# Verify clean state
git status

# Optionally recover stash later
git stash pop

Build Error Recovery

bash

# Clean build artifacts
rm -rf node_modules dist build .cache

# Reinstall dependencies
pnpm install --frozen-lockfile  # Clean install from lock file

# Rebuild
pnpm build

Environment Error Recovery

bash

# Check environment
env | grep -E "NODE|PNPM"

# Reset Node modules
rm -rf node_modules
pnpm install --frozen-lockfile

# If using nvm, verify version
nvm use

# Re-run init script
./scripts/init.sh

External Service Error

bash

# Check if service is up
curl -I https://service.example.com/health

# If down, wait and retry
sleep 60
curl -I https://service.example.com/health

# If still down, check status page
# Document as external blocker

Step 4: Verify

After recovery, verify clean state:

Basic Verification

bash

# Clean working directory
git status
# Expected: "nothing to commit, working tree clean" or known changes

# Tests pass
pnpm test

# Build succeeds
pnpm build

# Types check
pnpm typecheck

Functionality Verification

bash

# Run the specific thing that was broken
pnpm test --grep "specific test"

# Or verify the feature manually

Step 5: Document

Issue Comment

bash

gh issue comment [ISSUE_NUMBER] --body "## Error Recovery

**Error encountered:** [Description]

**Severity:** Major

**Evidence:**
\`\`\`
[Error output]
\`\`\`

**Recovery actions:**
1. [Action 1]
2. [Action 2]

**Verification:**
- [x] Tests pass
- [x] Build succeeds

**Root cause:** [If known]

**Prevention:** [If applicable]
"

Knowledge Graph

javascript

// Store for future reference
mcp__memory__add_observations({
  observations: [{
    entityName: "Issue #[NUMBER]",
    contents: [
      "Encountered [error type] on [date]",
      "Caused by: [root cause]",
      "Resolved by: [recovery action]"
    ]
  }]
});

Common Recovery Patterns

"Tests were passing, now failing"

bash

# What changed?
git diff HEAD~3

# Did dependencies change?
git diff HEAD~3 pnpm-lock.yaml

# Clean reinstall
rm -rf node_modules && pnpm install --frozen-lockfile

"Works locally, fails in CI"

bash

# Check for environment differences
# - Node version
# - OS differences
# - Env vars

# Run with CI-like settings
CI=true pnpm test

"Build was working, now broken"

bash

# Check TypeScript errors
pnpm typecheck

# Check for circular dependencies
pnpm dlx madge --circular src/

# Clean build
rm -rf dist && pnpm build

"I broke everything"

bash

# Don't panic
# Find last known good state
git log --oneline

# Reset to that state
git reset --hard [GOOD_COMMIT]

# Verify
pnpm test

# Start again more carefully

Escalation

If recovery fails after 2-3 attempts:

markdown

## Escalation: Unrecoverable Error

**Issue:** #[NUMBER]

**Error:** [Description]

**Recovery attempts:**
1. [Attempt 1] - [Result]
2. [Attempt 2] - [Result]

**Current state:** [Broken/Partially working]

**Evidence preserved:** [Links to logs, screenshots]

**Requesting help with:** [Specific question]

Mark issue as Blocked and await human input.

Checklist

When error occurs:

Severity assessed
Evidence preserved (logs, state, screenshots)
Recovery action selected
Recovery executed
Clean state verified
Tests pass
Build succeeds
Issue documented

Integration

This skill is called by:

issue-driven-development - When errors occur
ci-monitoring - CI failures

This skill may trigger:

research-after-failure - If cause is unknown
Issue update via issue-lifecycle

Maintainer

troykelly Core maintainer

Source details

Full Name: troykelly/codex-skills
Branch: main
Path in repo: skills/error-recovery

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

troykelly/codex-skills

hook-development

Use when the user wants to create Codex workflow hooks (pre/post run gates, tool-use validators, stop checks) or needs guidance on hook scripts and hooks.json configuration.

5 0

Explore

troykelly/codex-skills

sentry-setup-ai-monitoring

Setup Sentry AI Agent Monitoring in any project. Use this when asked to add AI monitoring, track LLM calls, monitor AI agents, or instrument OpenAI/Anthropic/Vercel AI/LangChain/Google GenAI. Automatically detects installed AI SDKs and configures the appropriate Sentry integration.

5 0

Explore

troykelly/codex-skills

agent-development

Use when the user wants to design Codex agent equivalents (specialized workers/profiles/prompt files), define triggering conditions, or build reusable agent prompts and validation tools.

5 0

Explore

troykelly/codex-skills

skill-development

Use when the user wants to create or refine Codex skills, improve skill descriptions, organize skill resources, or follow Codex skill best practices.

5 0

Explore

troykelly/codex-skills

sentry-setup-logging

Setup Sentry Logging in any project. Use this when asked to add Sentry logs, enable structured logging, setup console log capture, or integrate logging with Sentry. Supports JavaScript, TypeScript, Python, Ruby, React, Next.js, and other frameworks.

5 0

Explore

troykelly/codex-skills

frontend-design

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

5 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Error Recovery

Overview

The Recovery Protocol

Step 1: Assess Severity

Severity Levels

Assessment Questions

Step 2: Preserve Evidence

Error Logs

Stack Traces

State Capture

Screenshot (if visual)

Step 3: Recover

Decision Tree

Code Error Recovery

Build Error Recovery

Environment Error Recovery

External Service Error

Step 4: Verify

Basic Verification

Functionality Verification

Step 5: Document

Issue Comment

Knowledge Graph

Common Recovery Patterns

"Tests were passing, now failing"

"Works locally, fails in CI"

"Build was working, now broken"

"I broke everything"

Escalation

Checklist

Integration

Recommended Agent Skills

hook-development

sentry-setup-ai-monitoring

agent-development

skill-development

sentry-setup-logging

frontend-design