Agent skill
checkpoint
Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases.
Stars
232
Forks
15
Install this agent skill to your Project
npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/clouder0/checkpoint
SKILL.md
Checkpoint & Resume Skill
Pattern for saving workflow state and resuming after interruption.
When to Load This Skill
- Starting a workflow that might be interrupted
- Resuming after
claude -r - Recovering from crashes or timeouts
Core Concept
The dotagent workflow uses file-based state that survives session interruption:
Session crash/exit
↓
State files persist on disk:
- memory/state/phase.json # Which phase we're in
- memory/state/execution.json # Task-level progress
- memory/reports/*.json # Completed phase outputs
↓
claude -r (resume session)
↓
Orchestrator reads state, continues from last checkpoint
Checkpoint Files
Phase Checkpoint: memory/state/phase.json
json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}
Execution Checkpoint: memory/state/execution.json
See executor agent for detailed schema with:
- Task status tracking
- Timestamps (started_at, completed_at)
- Output file paths for verification
Resume Protocol
Step 1: Detect Resume Scenario
ON WORKFLOW START:
checkpoint = Read("memory/state/phase.json")
IF checkpoint exists AND checkpoint.phase_status == "in_progress":
→ This is a RESUME
→ Log: "Detected interrupted workflow: {workflow_id}"
→ Go to Step 2
ELSE:
→ Fresh start, create new checkpoint
Step 2: Validate State Integrity
VALIDATE:
1. Check all referenced output files exist
2. Check timestamps are reasonable (not future, not ancient)
3. Check phase progression is valid
4. Check for incomplete writes (interruption_safe flag)
IF validation fails:
→ Ask user: "State appears corrupted. Start fresh? [y/N]"
→ Archive corrupted state to memory/state/.archive/
Step 3: Determine Resume Point
RESUME LOGIC by phase:
REQUIREMENTS (in_progress):
- Check if demand.json exists and is valid
- If valid: advance to ARCHITECTURE
- If not: re-spawn PM agent
ARCHITECTURE (in_progress):
- Check for design files in memory/reports/designs/
- Check for final_design.json
- If final exists: advance to IMPLEMENTATION
- If designs exist but no final: spawn Roundtable
- If no designs: re-spawn Architects
IMPLEMENTATION (in_progress):
- Read execution.json
- Run executor recovery checks
- Continue execution loop
VERIFICATION (in_progress):
- Check for verification.json
- If exists: advance to REFLECTION
- If not: re-spawn QA
REFLECTION (in_progress):
- Check for reflection file
- If exists: workflow complete
- If not: re-spawn Reflector
Step 4: Inform User and Continue
LOG to user:
"Resuming workflow {id} from {phase} phase"
"Last activity: {timestamp}"
"Completed: {list of completed phases}"
IF current_phase requires user approval (was at checkpoint):
→ Re-confirm with user before proceeding
Safe Checkpoint Writing
Always update checkpoint atomically:
# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)
# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true
Recovery from Specific Scenarios
Scenario 1: Ctrl-C During Subagent
State: task-001 status="running", no output file
Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop
Scenario 2: Crash After Write, Before State Update
State: task-001 status="running", output file EXISTS
Recovery:
- Detect output file
- Read status from output
- Update state to match
Scenario 3: Interrupted During User Approval
State: phase=ARCHITECTURE, has designs but no final_design
Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects
Scenario 4: Ancient State File
State: started_at is 7 days ago
Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution
Checkpoint Frequency
Update checkpoint after:
- Phase completion
- User approval
- Each task status change (in executor)
- Before spawning expensive agents (opus)
Archiving Old State
When starting fresh or after completion:
Archive pattern:
memory/state/.archive/{workflow_id}_{timestamp}/
- phase.json
- execution.json
Keep last 5 archives, delete older
Integration with Workflow
In /develop Command
markdown
## Resume Check
Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
- Show resume prompt to user
- "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh
In Each Phase Agent
markdown
## On Completion
Before returning:
1. Write output file
2. Update phase.json:
- Add to completed_phases
- Advance current_phase
- Set phase_status = complete
3. Log checkpoint saved
Principles
- State on disk - Never rely on conversation memory alone
- Validate before resume - Don't blindly trust old state
- Inform the user - Always tell them what's being resumed
- Atomic writes - Prevent half-written state
- Archive, don't delete - Keep old state for debugging
Didn't find tool you were looking for?