Agent skill
checkpoint
Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases.
Install this agent skill to your Project
npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/clouder0/checkpoint
SKILL.md
Checkpoint & Resume Skill
Pattern for saving workflow state and resuming after interruption.
When to Load This Skill
- Starting a workflow that might be interrupted
- Resuming after
claude -r - Recovering from crashes or timeouts
Core Concept
The dotagent workflow uses file-based state that survives session interruption:
Session crash/exit
↓
State files persist on disk:
- memory/state/phase.json # Which phase we're in
- memory/state/execution.json # Task-level progress
- memory/reports/*.json # Completed phase outputs
↓
claude -r (resume session)
↓
Orchestrator reads state, continues from last checkpoint
Checkpoint Files
Phase Checkpoint: memory/state/phase.json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}
Execution Checkpoint: memory/state/execution.json
See executor agent for detailed schema with:
- Task status tracking
- Timestamps (started_at, completed_at)
- Output file paths for verification
Resume Protocol
Step 1: Detect Resume Scenario
ON WORKFLOW START:
checkpoint = Read("memory/state/phase.json")
IF checkpoint exists AND checkpoint.phase_status == "in_progress":
→ This is a RESUME
→ Log: "Detected interrupted workflow: {workflow_id}"
→ Go to Step 2
ELSE:
→ Fresh start, create new checkpoint
Step 2: Validate State Integrity
VALIDATE:
1. Check all referenced output files exist
2. Check timestamps are reasonable (not future, not ancient)
3. Check phase progression is valid
4. Check for incomplete writes (interruption_safe flag)
IF validation fails:
→ Ask user: "State appears corrupted. Start fresh? [y/N]"
→ Archive corrupted state to memory/state/.archive/
Step 3: Determine Resume Point
RESUME LOGIC by phase:
REQUIREMENTS (in_progress):
- Check if demand.json exists and is valid
- If valid: advance to ARCHITECTURE
- If not: re-spawn PM agent
ARCHITECTURE (in_progress):
- Check for design files in memory/reports/designs/
- Check for final_design.json
- If final exists: advance to IMPLEMENTATION
- If designs exist but no final: spawn Roundtable
- If no designs: re-spawn Architects
IMPLEMENTATION (in_progress):
- Read execution.json
- Run executor recovery checks
- Continue execution loop
VERIFICATION (in_progress):
- Check for verification.json
- If exists: advance to REFLECTION
- If not: re-spawn QA
REFLECTION (in_progress):
- Check for reflection file
- If exists: workflow complete
- If not: re-spawn Reflector
Step 4: Inform User and Continue
LOG to user:
"Resuming workflow {id} from {phase} phase"
"Last activity: {timestamp}"
"Completed: {list of completed phases}"
IF current_phase requires user approval (was at checkpoint):
→ Re-confirm with user before proceeding
Safe Checkpoint Writing
Always update checkpoint atomically:
# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)
# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true
Recovery from Specific Scenarios
Scenario 1: Ctrl-C During Subagent
State: task-001 status="running", no output file
Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop
Scenario 2: Crash After Write, Before State Update
State: task-001 status="running", output file EXISTS
Recovery:
- Detect output file
- Read status from output
- Update state to match
Scenario 3: Interrupted During User Approval
State: phase=ARCHITECTURE, has designs but no final_design
Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects
Scenario 4: Ancient State File
State: started_at is 7 days ago
Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution
Checkpoint Frequency
Update checkpoint after:
- Phase completion
- User approval
- Each task status change (in executor)
- Before spawning expensive agents (opus)
Archiving Old State
When starting fresh or after completion:
Archive pattern:
memory/state/.archive/{workflow_id}_{timestamp}/
- phase.json
- execution.json
Keep last 5 archives, delete older
Integration with Workflow
In /develop Command
## Resume Check
Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
- Show resume prompt to user
- "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh
In Each Phase Agent
## On Completion
Before returning:
1. Write output file
2. Update phase.json:
- Add to completed_phases
- Advance current_phase
- Set phase_status = complete
3. Log checkpoint saved
Principles
- State on disk - Never rely on conversation memory alone
- Validate before resume - Don't blindly trust old state
- Inform the user - Always tell them what's being resumed
- Atomic writes - Prevent half-written state
- Archive, don't delete - Keep old state for debugging
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
perigon-backend
Perigon ASP.NET Core + EF Core + Aspire conventions
perigon-agent
Pointers for Copilot/agents to apply Perigon conventions
perigon-angular
Angular 21+ standalone/Material/signal conventions for Perigon WebApp
fastapi-mastery
Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.
context7-efficient
Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.
browser-use
Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.
Didn't find tool you were looking for?