Agent skill
incident-runbook-generator
Creates step-by-step incident response runbooks for common outages with actions, owners, rollback procedures, and communication templates. Use for "incident runbook", "outage response", "incident management", or "on-call procedures".
Install this agent skill to your Project
npx add-skill https://github.com/patricio0312rev/skills/tree/main/performance/incident-runbook-generator
SKILL.md
Incident Runbook Generator
Create actionable runbooks for common incidents.
Runbook Template
# Runbook: Database Connection Pool Exhausted
**Severity:** P1 (Critical)
**Estimated Time to Resolve:** 15-30 minutes
**Owner:** Database Team (On-call)
## Symptoms
- Application errors: "connection pool exhausted"
- Increased API latency (>5s)
- Failed health checks
- CloudWatch alarm: `DatabaseConnectionsHigh`
## Detection
- Alert: DatabaseConnectionPoolExhausted
- Metrics: `active_connections > max_connections * 0.9`
- Logs: "Error: connect ETIMEDOUT"
## Immediate Actions (5 min)
1. **Verify the issue**
```bash
# Check current connections
SELECT count(*) FROM pg_stat_activity;
```
-
Identify long-running queries
sqlSELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; -
Kill blocking queries (if safe)
sqlSELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';
Mitigation (10 min)
-
Scale up connection pool (temporary)
bash# Update RDS parameter group aws rds modify-db-parameter-group \ --db-parameter-group-name prod-params \ --parameters "ParameterName=max_connections,ParameterValue=200" -
Restart application (if needed)
bashkubectl rollout restart deployment/api -
Monitor recovery
bashwatch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity;"'
Root Cause Investigation
Check for:
- Recent deployment (new code with connection leaks)
- Traffic spike (legitimate or DDoS)
- Slow queries holding connections
- Connection pool configuration too small
- Application not releasing connections
Rollback Steps
If caused by deployment:
# Rollback to previous version
kubectl rollout undo deployment/api
# Verify
kubectl rollout status deployment/api
Communication Template
Initial (within 5 min):
🚨 INCIDENT: Database connection pool exhausted
Status: Investigating
Impact: API errors and slowness
ETA: 15-30 min
Next update: 10 min
Update (every 10 min):
UPDATE: Killed long-running queries
Status: Mitigating
Impact: Still degraded, improving
Actions: Scaling connection pool
Next update: 10 min
Resolution:
✅ RESOLVED: Database connections normalized
Duration: 25 minutes
Root cause: Connection leak in v2.3.4
Fix: Rolled back to v2.3.3
Follow-up: Bug fix PR #1234
Postmortem: [link]
Prevention
- Add connection pool metrics to dashboards
- Implement connection timeout (30s)
- Add connection leak detection in tests
- Set up pre-deployment load testing
- Review connection pool sizing
Related Runbooks
- Database High CPU
- Slow Database Queries
- Application OOM
## Output Checklist
- [ ] Symptoms documented
- [ ] Detection criteria
- [ ] Step-by-step actions
- [ ] Owner assigned
- [ ] Rollback procedure
- [ ] Communication templates
- [ ] Prevention measures
ENDFILE
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
rate-limiting-abuse-protection
Implements rate limiting and abuse prevention with per-route policies, IP/user-based limits, sliding windows, safe error responses, and observability. Use when adding "rate limiting", "API protection", "abuse prevention", or "DDoS protection".
rbac-permissions-builder
Implements role-based access control with permission matrix, route guards, policy functions, and UI permission hints. Provides middleware/guards, helper utilities, test suggestions, and permission checking patterns. Use when building "RBAC", "permissions", "access control", or "authorization".
websocket-realtime-builder
Implements real-time features using WebSockets with Socket.io, rooms, authentication, and reconnection handling. Use when users request "real-time updates", "WebSocket", "Socket.io", "live chat", or "push notifications".
webhook-receiver-hardener
Secures webhook receivers with signature verification, retry handling, deduplication, idempotency keys, and error responses. Provides verification code, dedupe storage strategy, runbook for incidents. Use when implementing "webhooks", "webhook security", "event receivers", or "third-party integrations".
auth-module-builder
Implements secure authentication patterns including login/registration, session management, JWT tokens, password hashing, cookie settings, and CSRF protection. Provides auth routes, middleware, security configurations, and threat model documentation. Use when building "authentication", "login system", "JWT auth", or "session management".
rest-to-graphql-migrator
Migrates REST APIs to GraphQL incrementally with schema stitching, REST datasources, and gradual endpoint migration. Use when users request "migrate to GraphQL", "REST to GraphQL", "GraphQL wrapper", or "API modernization".
Didn't find tool you were looking for?