Agent skill

checkpoint

Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases.

Stars 232
Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/clouder0/checkpoint

SKILL.md

Checkpoint & Resume Skill

Pattern for saving workflow state and resuming after interruption.

When to Load This Skill

  • Starting a workflow that might be interrupted
  • Resuming after claude -r
  • Recovering from crashes or timeouts

Core Concept

The dotagent workflow uses file-based state that survives session interruption:

Session crash/exit
        ↓
State files persist on disk:
  - memory/state/phase.json        # Which phase we're in
  - memory/state/execution.json    # Task-level progress
  - memory/reports/*.json          # Completed phase outputs
        ↓
claude -r (resume session)
        ↓
Orchestrator reads state, continues from last checkpoint

Checkpoint Files

Phase Checkpoint: memory/state/phase.json

json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}

Execution Checkpoint: memory/state/execution.json

See executor agent for detailed schema with:

  • Task status tracking
  • Timestamps (started_at, completed_at)
  • Output file paths for verification

Resume Protocol

Step 1: Detect Resume Scenario

ON WORKFLOW START:
  checkpoint = Read("memory/state/phase.json")

  IF checkpoint exists AND checkpoint.phase_status == "in_progress":
    → This is a RESUME
    → Log: "Detected interrupted workflow: {workflow_id}"
    → Go to Step 2
  ELSE:
    → Fresh start, create new checkpoint

Step 2: Validate State Integrity

VALIDATE:
  1. Check all referenced output files exist
  2. Check timestamps are reasonable (not future, not ancient)
  3. Check phase progression is valid
  4. Check for incomplete writes (interruption_safe flag)

IF validation fails:
  → Ask user: "State appears corrupted. Start fresh? [y/N]"
  → Archive corrupted state to memory/state/.archive/

Step 3: Determine Resume Point

RESUME LOGIC by phase:

REQUIREMENTS (in_progress):
  - Check if demand.json exists and is valid
  - If valid: advance to ARCHITECTURE
  - If not: re-spawn PM agent

ARCHITECTURE (in_progress):
  - Check for design files in memory/reports/designs/
  - Check for final_design.json
  - If final exists: advance to IMPLEMENTATION
  - If designs exist but no final: spawn Roundtable
  - If no designs: re-spawn Architects

IMPLEMENTATION (in_progress):
  - Read execution.json
  - Run executor recovery checks
  - Continue execution loop

VERIFICATION (in_progress):
  - Check for verification.json
  - If exists: advance to REFLECTION
  - If not: re-spawn QA

REFLECTION (in_progress):
  - Check for reflection file
  - If exists: workflow complete
  - If not: re-spawn Reflector

Step 4: Inform User and Continue

LOG to user:
  "Resuming workflow {id} from {phase} phase"
  "Last activity: {timestamp}"
  "Completed: {list of completed phases}"

IF current_phase requires user approval (was at checkpoint):
  → Re-confirm with user before proceeding

Safe Checkpoint Writing

Always update checkpoint atomically:

# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)

# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true

Recovery from Specific Scenarios

Scenario 1: Ctrl-C During Subagent

State: task-001 status="running", no output file

Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop

Scenario 2: Crash After Write, Before State Update

State: task-001 status="running", output file EXISTS

Recovery:
- Detect output file
- Read status from output
- Update state to match

Scenario 3: Interrupted During User Approval

State: phase=ARCHITECTURE, has designs but no final_design

Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects

Scenario 4: Ancient State File

State: started_at is 7 days ago

Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution

Checkpoint Frequency

Update checkpoint after:

  • Phase completion
  • User approval
  • Each task status change (in executor)
  • Before spawning expensive agents (opus)

Archiving Old State

When starting fresh or after completion:

Archive pattern:
  memory/state/.archive/{workflow_id}_{timestamp}/
    - phase.json
    - execution.json

Keep last 5 archives, delete older

Integration with Workflow

In /develop Command

markdown
## Resume Check

Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
   - Show resume prompt to user
   - "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh

In Each Phase Agent

markdown
## On Completion

Before returning:
1. Write output file
2. Update phase.json:
   - Add to completed_phases
   - Advance current_phase
   - Set phase_status = complete
3. Log checkpoint saved

Principles

  1. State on disk - Never rely on conversation memory alone
  2. Validate before resume - Don't blindly trust old state
  3. Inform the user - Always tell them what's being resumed
  4. Atomic writes - Prevent half-written state
  5. Archive, don't delete - Keep old state for debugging

Didn't find tool you were looking for?

Be as detailed as possible for better results