Agent skill
Operational Runbooks
Step-by-step procedures for operating and troubleshooting systems, including incident response, maintenance tasks, and operational workflows.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/runbooks
SKILL.md
Operational Runbooks
Current Level: Expert (Enterprise Scale) Domain: Documentation / Operations
Overview
Runbooks provide step-by-step procedures for operating and troubleshooting systems. Effective runbooks enable teams to handle incidents, perform maintenance, and operate systems consistently with clear, actionable instructions.
1. Executive Summary & Strategic Necessity
-
Context: ในปี 2025-2026 Operational Runbooks ด้วย Best Practices ช่วย Operations ที่มีอัตโนมาติการทำงานอัตโนมาติ (Operational Excellence) ใน Enterprise Scale
-
Business Impact: Operational Runbooks ช่วยลด Downtime ของระบบ Operations ผ่านการตอบคำถามอัตโนมาติการเขียนเอกสาร (Reduce MTTR), ลดต้นทุนการจัดการทีม (Increase operational efficiency), เพิ่มอัตรากำไร Gross Margin ผ่านการทำงานอัตโนมาติ (Faster incident resolution), และปรับประสบทการทำงาน (Consistent operations)
-
Product Thinking: Operational Runbooks ช่วยแก้ปัญหา (Pain Point) ความต้องการมีการเขียนเอกสารที่ชัดเจน (Teams need clear operational procedures) ผ่านการทำงานอัตโนมาติ (Standardized procedures)
2. Technical Deep Dive (The "How-to")
-
Core Logic: Operational Runbooks ใช้ Best Practices ช่วย Operations ทำงานอัตโนมาติ:
- Runbook Structure: กำหนด Runbook Structure (Service Overview, Architecture, Deployment, Monitoring, Alerts, Troubleshooting, Emergency Procedures, Maintenance, Tools and Access, Contacts)
- Runbook Templates: สร้าง Runbook Templates สำหรับการเขียน runbooks (Basic template, Advanced template)
- Runbook Testing: จัดการ Runbook Testing (Accuracy testing, Completeness testing, Usability testing, Emergency testing)
- Runbook Maintenance: สร้าง Runbook Maintenance สำหรับการดูแล runbooks (Regular updates, Quarterly reviews, Testing procedures)
- Agent-Friendly Runbooks: สร้าง Agent-Friendly Runbooks สำหรับ AI Agents (Clear steps, Verification criteria, Troubleshooting procedures)
-
Architecture Diagram Requirements: แผนผังระบบ Operational Runbooks ต้องมีองค์ประกอบ:
- Runbook Repository: Runbook Repository สำหรับการจัดเก็บ runbooks (Git repository, Documentation platform)
- Runbook Templates: Runbook Templates สำหรับการเขียน runbooks (Basic template, Advanced template)
- Runbook Testing Framework: Runbook Testing Framework สำหรับการทดสอบ runbooks (Accuracy testing, Completeness testing, Usability testing)
- Runbook Maintenance System: Runbook Maintenance System สำหรับการดูแล runbooks (Version control, Review process, Update notifications)
- Runbook Search: Runbook Search สำหรับการค้นหา runbooks (Full-text search, Tag-based search, Category-based search)
- Runbook Analytics: Runbook Analytics สำหรับการวิเคราะห์ runbook usage (Usage metrics, Search analytics, Feedback collection)
- Observability: Logging, Monitoring, Tracing สำหรับการ debug และปรับสิทท
-
Implementation Workflow: ขั้นตอนการนำ Operational Runbooks ไปใช้งานจริง:
- Planning Phase: กำหนด Requirement และเลือก Runbook Platform ที่เหมาะสม
- Runbook Repository Setup: ตั้งค่า Runbook Repository สำหรับการจัดเก็บ runbooks
- Runbook Templates Creation: สร้าง Runbook Templates สำหรับการเขียน runbooks
- Runbook Testing Framework Setup: ตั้งค่า Runbook Testing Framework สำหรับการทดสอบ runbooks
- Runbook Maintenance System Setup: ตั้งค่า Runbook Maintenance System สำหรับการดูแล runbooks
- Testing Phase: Unit test, Integration test, E2E test ด้วยจริง Scenario
- Deployment: Deploy ด้วย CI/CD pipeline, Set up Monitoring
- Optimization: Optimize runbook search, Add runbook templates, Improve UX
- Maintenance: Monitor runbook usage, Update runbook content, Handle edge cases
3. Tooling & Tech Stack
-
Enterprise Tools: เครื่องมือระดับอุตสาหกรรมที่เลือกใช้สำหรับ Operational Runbooks ใน Enterprise Scale:
- Git: Version control system สำหรับ runbook versioning
- GitHub/GitLab: Git hosting platforms สำหรับ runbook collaboration
- Confluence: Documentation platform สำหรับ runbook storage
- Notion: Documentation platform สำหรับ runbook storage
- GitBook: Documentation platform สำหรับ runbook storage
- Algolia: Search platform สำหรับ runbook search
- Elasticsearch: Search platform สำหรับ runbook search
- Grafana: Monitoring platform สำหรับ runbook analytics
- Prometheus: Metrics platform สำหรับ runbook analytics
- PagerDuty: Incident management platform สำหรับ runbook integration
-
Configuration Essentials: การตั้งค่าสำคัญสำหรับให้ระบบเสถียร Operational Runbooks:
- Runbook Templates: ตั้งค่า Runbook Templates (Basic template, Advanced template)
- Runbook Testing: ตั้งค่า Runbook Testing (Accuracy testing, Completeness testing, Usability testing)
- Runbook Maintenance: ตั้งค่า Runbook Maintenance (Version control, Review process, Update notifications)
- Runbook Search: ตั้งค่า Runbook Search (Full-text search, Tag-based search, Category-based search)
- Monitoring: ตั้งค่า Monitoring สำหรับ tracking runbook usage (Usage metrics, Search analytics, Feedback collection)
- Secret Management: Use Environment variables หรือ Secret Manager (AWS Secrets Manager, HashiCorp Vault)
- Rate Limiting: Per-user และ Per-IP rate limits สำหรับป้องกัน Abuse (100-1000 requests/hour)
- Logging Level: INFO สำหรับ Production, DEBUG สำหรับ Development
- Observability: Track success rate, runbook usage, search accuracy ต่อเป้าหลาย
- Documentation: Maintain Runbook Platform documentation สำหรับ team reference
4. Standards, Compliance & Security
-
International Standards: มาตรฐานที่เกี่ยวข้อง:
- ISO/IEC 27001: Information Security Management - สำหรับการจัดการ Secrets และ Access Control
- ISO/IEC 27017: Code of Practice for Information Security Controls - สำหรับ Secure Operations
- ISO/IEC 22301: Business Continuity Management - สำหรับ Disaster Recovery
- GDPR: General Data Protection Regulation - สำหรับการจัดการ Personal Data และ User Consent
- SOC 2 Type II: Security Controls - สำหรับการ Audit และ Compliance
-
Security Protocol: กลไกการป้องกัน Operational Runbooks:
- Input Validation: Validate และ Sanitize ทุก Input ก่อน processing (Prevent XSS, SQL injection)
- Output Sanitization: Filter sensitive information จาก runbooks (API keys, Secrets, Passwords)
- Access Control: RBAC (Role-Based Access Control) สำหรับ runbook access - บาง runbooks internal only
- Audit Trail: Log ทุก runbook access ด้วย Timestamp, User ID, และ Runbook accessed (สำหรับ Forensics และ Compliance)
- Rate Limiting: Per-user และ Per-IP rate limits สำหรับป้องกัน Abuse (100-1000 requests/hour)
- Secure Communication: TLS 1.3 สำหรับ HTTPS access
- Secret Management: Use Environment variables หรือ Secret Manager (AWS Secrets Manager, HashiCorp Vault)
- Content Security: CSP headers สำหรับ preventing XSS attacks
- Authentication: Implement authentication สำหรับ internal runbooks (SSO, OAuth)
- Data Encryption: Encrypt sensitive data ที่ rest ใน Database (AES-256 หรือ Customer-managed keys)
-
Explainability: (สำหรับ Runbooks) ความสามารถในการอธิบายผลลัพธ์ผ่านเทคนิค:
- Clear Structure: เก็บ runbook structure สำหรับ easy understanding
- Detailed Steps: Provide detailed steps สำหรับ complex procedures
- Verification Criteria: Include verification criteria สำหรับ each step
- Troubleshooting Procedures: Provide troubleshooting procedures สำหรับ common issues
- Reference Links: Link to external documentation สำหรับ complex topics
5. Unit Economics & Performance Metrics (KPIs)
-
Cost Calculation: สูตรการคำนวณต้นทุกต่อหน่วย Operational Runbooks:
- Platform Cost = Platform subscription × Cost per user/month
- Confluence: $5-15/user/month
- Notion: $8-15/user/month
- GitBook: $8-20/user/month
- Storage Cost = Runbook storage × Cost per GB/month
- GitHub Pages: Free
- GitLab Pages: Free
- S3: $0.023/GB/month
- Search Cost = Search API calls × Cost per 1000 calls
- Algolia: $1-5/1000 calls
- Elasticsearch: $0-100/month (self-hosted)
- Monitoring Cost = Monitoring platform subscription × Cost per month
- Grafana Cloud: $50-500/month
- Datadog: $15-23/host/month
- Total Monthly Cost = Platform Cost + Storage Cost + Search Cost + Monitoring Cost
- Infrastructure Costs = Compute ($0/month for static sites) + Storage ($0/month for static sites) + Monitoring ($50-500/month)
- Platform Cost = Platform subscription × Cost per user/month
-
Key Performance Indicators: ตัวชี้วัดความสำเร็จทางเทคนิค:
- Runbook Coverage: เปอร์เซ็นต์ของ services ที่มี runbooks (Target: >80%)
- Runbook Quality Score: คะแนน runbook quality จาก automated checks (Target: >4.0)
- Runbook Accuracy: เปอร์เซ็นต์ของ runbooks ที่มี accurate procedures (Target: >95%)
- Runbook Completeness: เปอร์เซ็นต์ของ runbooks ที่มี complete procedures (Target: >90%)
- Operator Satisfaction Score: 1-5 rating จาก Operator feedback (Target: >4.0)
- Error Rate: อัตราการ Error (Target: <1%)
- MTTR: Mean Time To Resolution (Target: <30 minutes)
- Runbook Usage: เปอร์เซ็นต์ของ operators ที่ใช้ runbooks (Target: >80%)
- Search Accuracy: เปอร์เซ็นต์ของ search results ที่ relevant (Target: >90%)
- Knowledge Transfer: เปอร์เซ็นต์ของ knowledge transfer (Target: >80%)
6. Strategic Recommendations (CTO Insights)
-
Phase Rollout: คำแนะนำในการทยอยเริ่มใช้งาน Operational Runbooks เพื่อลดความเสี่ยง:
- Phase 1: MVP (1-2 เดือน): Deploy Simple Operational Runbooks ด้วย Basic Templates และ Manual review สำหรับ Internal team ก่อนเปิดให้ Public
- Goal: Validate Operational Runbooks architecture และ gather feedback
- Success Criteria: >80% runbook coverage, <30s search time
- Risk Mitigation: Internal-only access, Manual review ก่อน Public
- Phase 2: Beta (2-3 เดือน): Expand ด้วย Advanced Templates และ Runbook Testing Framework สำหรับ Selected customers
- Goal: Test scalability และ Runbook reliability
- Success Criteria: >90% runbook coverage, <15s search time
- Risk Mitigation: Canary deployment, Feature flags, Gradual rollout
- Phase 3: GA (3-6 เดือน): Full rollout ด้วย Advanced features (Runbook Search, Runbook Analytics, Agent-Friendly Runbooks)
- Goal: Enterprise-grade runbook quality และ Performance
- Success Criteria: >95% runbook coverage, <10s search time, 99.9% uptime
- Risk Mitigation: Load testing, Disaster recovery, Blue-green deployment
- Phase 1: MVP (1-2 เดือน): Deploy Simple Operational Runbooks ด้วย Basic Templates และ Manual review สำหรับ Internal team ก่อนเปิดให้ Public
-
Pitfalls to Avoid: ข้อควรระวังที่มักจะผิดพลาดในระดับ Enterprise Scale:
- Over-engineering: สร้าง Operational Runbooks ที่ซ้อนเกินไป (Too many sections, Complex templates) → เริ่มจาก Simple และ iterate
- No Runbook Templates: ไม่มี Runbook Templates ทำให้ consistency ลด → Implement Runbook Templates สำหรับ common patterns
- Outdated Runbooks: Runbooks ไม่ sync กับ systems → Implement automated runbook quality checks
- Missing Runbook Testing: ไม่มี Runbook Testing ทำให้ runbooks ไม่ reliable → Implement Runbook Testing Framework
- No Runbook Maintenance: ไม่มี Runbook Maintenance ทำให้ runbooks outdated → Implement Runbook Maintenance System
- No Agent-Friendly Runbooks: ไม่มี Agent-Friendly Runbooks ทำให้ AI Agents สับสนใจ → Implement Agent-Friendly runbook guidelines
- Poor Search: Search ไม่ดีทำให้ operators หา runbooks ไม่ได้ → Implement advanced search (Algolia, Elasticsearch)
- No Runbook Analytics: ไม่มี Runbook Analytics ทำให้ไม่รู้ runbook usage → Implement runbook analytics
- No Runbook Documentation: ไม่มี Runbook Documentation ทำให้ team reference → Maintain Runbook Platform documentation
- Single Point of Failure: ไม่มี Redundancy หรือ Fallback → Deploy multiple instances ด้วย CDN
Core Concepts
1. Runbook Purpose
Why Runbooks Matter
# Runbook Purpose
## Benefits
### 1. Operational Excellence
- Standardized procedures
- Consistent operations
- Reduced errors
- Improved reliability
### 2. Knowledge Management
- Preserves institutional knowledge
- Onboards new team members
- Reduces dependency on individuals
- Enables knowledge sharing
### 3. Incident Response
- Faster resolution
- Reduced MTTR (Mean Time To Resolution)
- Clear communication
- Better coordination
### 4. Compliance and Auditing
- Documented procedures
- Audit trails
- Regulatory compliance
- Process improvement
## Consequences of Poor Runbooks
### 1. Operational Issues
- Inconsistent procedures
- Human errors
- Longer resolution times
- Increased downtime
### 2. Knowledge Loss
- When team members leave
- Over time
- During incidents
- Under pressure
### 3. Increased Costs
- More incidents
- Longer resolution times
- Higher support costs
- Lost revenue
### 4. Team Stress
- Uncertainty during incidents
- Lack of confidence
- Burnout
- Turnover
2. Runbook Structure
Standard Runbook Structure
# [Service Name] Runbook
## Service Overview
- Service description
- Purpose and scope
- Criticality level
- Dependencies
## Architecture
- System architecture
- Components and their roles
- Data flow
- Integration points
## Deployment
- Deployment process
- Deployment environments
- Rollback procedures
- Deployment verification
## Monitoring
- Key metrics
- Alert thresholds
- Monitoring dashboards
- Log sources
## Alerts
- Alert definitions
- Severity levels
- Notification channels
- Escalation paths
## Troubleshooting
- Common issues
- Diagnostic steps
- Resolution procedures
- Known workarounds
## Emergency Procedures
- Critical incidents
- Emergency contacts
- Escalation procedures
- Communication templates
## Maintenance
- Maintenance procedures
- Backup and restore
- Health checks
- Performance tuning
## Tools and Access
- Required tools
- Access requirements
- Command references
- Useful scripts
## Contacts
- On-call rotation
- Subject matter experts
- Escalation contacts
- Support channels
3. Service Overview
Service Overview Template
# Service Overview
## Service Name
[Service Name]
## Description
[Brief description of what the service does and its purpose]
## Criticality
- **Level**: [Critical | High | Medium | Low]
- **SLA**: [Service Level Agreement details]
- **RTO**: [Recovery Time Objective]
- **RPO**: [Recovery Point Objective]
## Purpose and Scope
- **Primary Purpose**: [Main function]
- **Key Features**: [List of key features]
- **Target Users**: [Who uses this service]
- **Service Boundaries**: [What's in scope and what's not]
## Dependencies
### Upstream Dependencies
- [Dependency 1] - [Purpose]
- [Dependency 2] - [Purpose]
- [Dependency 3] - [Purpose]
### Downstream Dependencies
- [Dependency 1] - [Purpose]
- [Dependency 2] - [Purpose]
- [Dependency 3] - [Purpose]
## Technology Stack
- **Language**: [Programming language]
- **Framework**: [Framework name]
- **Database**: [Database type and version]
- **Infrastructure**: [Cloud provider, hosting]
- **Other**: [Other relevant technologies]
## Key Metrics
- **Throughput**: [Requests per second]
- **Latency**: [P50, P95, P99]
- **Error Rate**: [Target error rate]
- **Availability**: [Target uptime percentage]
## Known Limitations
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]
4. Architecture
Architecture Documentation
# Architecture
## System Architecture
### High-Level Architecture
```mermaid
graph TB
subgraph "Service"
Component1[Component 1]
Component2[Component 2]
Component3[Component 3]
end
subgraph "Data Layer"
DB[(Database)]
Cache[(Cache)]
end
subgraph "External"
API[External API]
Queue[Message Queue]
end
Component1 --> Component2
Component2 --> Component3
Component2 --> DB
Component3 --> Cache
Component1 --> API
Component3 --> Queue
Components
Component 1: [Component Name]
- Purpose: [What it does]
- Technology: [Technology stack]
- Dependencies: [What it depends on]
- Interfaces: [APIs, protocols]
- Scaling: [Horizontal/Vertical, limits]
Component 2: [Component Name]
- Purpose: [What it does]
- Technology: [Technology stack]
- Dependencies: [What it depends on]
- Interfaces: [APIs, protocols]
- Scaling: [Horizontal/Vertical, limits]
Component 3: [Component Name]
- Purpose: [What it does]
- Technology: [Technology stack]
- Dependencies: [What it depends on]
- Interfaces: [APIs, protocols]
- Scaling: [Horizontal/Vertical, limits]
Data Flow
Request Flow
- Client → Load Balancer
- Load Balancer → API Gateway
- API Gateway → Service
- Service → Database
- Database → Service
- Service → Client
Data Flow Diagram
sequenceDiagram
participant Client
participant LB
participant API
participant DB
Client->>LB: Request
LB->>API: Forward Request
API->>DB: Query
DB-->>API: Data
API-->>LB: Response
LB-->>Client: Response
Integration Points
External Integrations
- [Integration 1]: [Description, API docs link]
- [Integration 2]: [Description, API docs link]
- [Integration 3]: [Description, API docs link]
Internal Integrations
- [Integration 1]: [Description, API docs link]
- [Integration 2]: [Description, API docs link]
- [Integration 3]: [Description, API docs link]
---
## 5. Deployment
### Deployment Procedures
```markdown
# Deployment
## Deployment Process
### Prerequisites
- [ ] Access to deployment environment
- [ ] Valid deployment credentials
- [ ] Deployment tools installed
- [ ] Database migrations prepared
- [ ] Feature flags configured
- [ ] Monitoring configured
### Deployment Steps
#### 1. Pre-Deployment Checks
```bash
# Run health checks
curl https://api.example.com/health
# Check database connectivity
./scripts/check-db.sh
# Verify configuration
./scripts/verify-config.sh
2. Build and Test
# Build application
npm run build
# Run tests
npm test
# Run integration tests
npm run test:integration
3. Deploy
# Deploy to staging
./scripts/deploy.sh staging
# Verify staging deployment
./scripts/verify.sh staging
# Deploy to production
./scripts/deploy.sh production
4. Post-Deployment Verification
# Run smoke tests
./scripts/smoke-tests.sh
# Check metrics
./scripts/check-metrics.sh
# Verify logs
./scripts/check-logs.sh
Rollback Procedures
When to Rollback
- Deployment fails
- Critical errors detected
- Performance degradation
- Data corruption
Rollback Steps
# Stop current deployment
./scripts/stop.sh
# Rollback to previous version
./scripts/rollback.sh
# Verify rollback
./scripts/verify.sh
# Start services
./scripts/start.sh
Deployment Environments
Development
- URL: https://dev.example.com
- Database: dev-db.example.com
- Purpose: Development and testing
- Data: Test data, can be reset
Staging
- URL: https://staging.example.com
- Database: staging-db.example.com
- Purpose: Pre-production testing
- Data: Production-like data
Production
- URL: https://example.com
- Database: prod-db.example.com
- Purpose: Live production
- Data: Real user data
Deployment Verification
Health Check Endpoint
# Health check
curl https://api.example.com/health
# Expected response
{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2024-01-15T10:00:00Z"
}
Smoke Tests
# Run smoke tests
npm run smoke-tests
# Expected output
✓ API is responding
✓ Database is accessible
✓ Cache is working
✓ External integrations are functional
---
## 6. Monitoring
### Monitoring Setup
```markdown
# Monitoring
## Key Metrics
### Application Metrics
- **Request Rate**: Requests per second
- **Response Time**: P50, P95, P99 latency
- **Error Rate**: Percentage of failed requests
- **Throughput**: Requests handled per second
### System Metrics
- **CPU Usage**: Percentage of CPU utilization
- **Memory Usage**: Percentage of memory utilization
- **Disk Usage**: Percentage of disk utilization
- **Network I/O**: Network traffic in/out
### Business Metrics
- **Active Users**: Number of active users
- **Transaction Volume**: Number of transactions
- **Conversion Rate**: Percentage of conversions
- **Revenue**: Revenue generated
## Alert Thresholds
### Warning Thresholds
- CPU > 70%
- Memory > 80%
- Error rate > 1%
- Response time P95 > 500ms
### Critical Thresholds
- CPU > 90%
- Memory > 95%
- Error rate > 5%
- Response time P95 > 2000ms
## Monitoring Dashboards
### Main Dashboard
- **URL**: https://grafana.example.com/d/main
- **Metrics**: All key metrics
- **Refresh**: 30 seconds
### Error Dashboard
- **URL**: https://grafana.example.com/d/errors
- **Metrics**: Error rates, error types
- **Refresh**: 10 seconds
### Performance Dashboard
- **URL**: https://grafana.example.com/d/performance
- **Metrics**: Response times, throughput
- **Refresh**: 30 seconds
## Log Sources
### Application Logs
- **Location**: /var/log/app/
- **Format**: JSON
- **Retention**: 30 days
- **Access**: SSH to server
### Access Logs
- **Location**: /var/log/nginx/access.log
- **Format**: Combined
- **Retention**: 90 days
- **Access**: SSH to server
### Error Logs
- **Location**: /var/log/app/error.log
- **Format**: JSON
- **Retention**: 90 days
- **Access**: SSH to server
## Monitoring Tools
### Application Monitoring
- **Tool**: [APM tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]
### Infrastructure Monitoring
- **Tool**: [Infrastructure tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]
### Log Aggregation
- **Tool**: [Log aggregation tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]
7. Alerts
Alert Configuration
# Alerts
## Alert Definitions
### Alert 1: High CPU Usage
- **Name**: High CPU Usage
- **Condition**: CPU > 90% for 5 minutes
- **Severity**: Critical
- **Notification**: PagerDuty, Slack
- **Escalation**: 15 minutes
### Alert 2: High Error Rate
- **Name**: High Error Rate
- **Condition**: Error rate > 5% for 5 minutes
- **Severity**: Critical
- **Notification**: PagerDuty, Slack
- **Escalation**: 10 minutes
### Alert 3: High Memory Usage
- **Name**: High Memory Usage
- **Condition**: Memory > 95% for 5 minutes
- **Severity**: Warning
- **Notification**: Slack
- **Escalation**: 30 minutes
### Alert 4: Slow Response Time
- **Name**: Slow Response Time
- **Condition**: P95 > 2000ms for 5 minutes
- **Severity**: Warning
- **Notification**: Slack
- **Escalation**: 30 minutes
## Severity Levels
### Critical
- **Definition**: Service is down or severely degraded
- **Response Time**: < 15 minutes
- **Notification**: PagerDuty, Slack, Email
- **Escalation**: 15 minutes
### High
- **Definition**: Service is degraded but functional
- **Response Time**: < 30 minutes
- **Notification**: Slack, Email
- **Escalation**: 30 minutes
### Medium
- **Definition**: Potential issue detected
- **Response Time**: < 1 hour
- **Notification**: Slack
- **Escalation**: 1 hour
### Low
- **Definition**: Informational alert
- **Response Time**: < 4 hours
- **Notification**: Email
- **Escalation**: None
## Notification Channels
### PagerDuty
- **Service**: [Service name]
- **Escalation Policy**: [Policy name]
- **On-Call Schedule**: [Schedule name]
### Slack
- **Channel**: #alerts-[service-name]
- **Webhook**: [Webhook URL]
### Email
- **Recipients**: [Email list]
- **Subject**: [Alert prefix]
## Escalation Paths
### Level 1: On-Call Engineer
- **Response Time**: 15 minutes
- **Contact**: [Contact method]
- **Capabilities**: Full access
### Level 2: Team Lead
- **Response Time**: 30 minutes
- **Contact**: [Contact method]
- **Capabilities**: Full access
### Level 3: Engineering Manager
- **Response Time**: 1 hour
- **Contact**: [Contact method]
- **Capabilities**: Full access
### Level 4: VP of Engineering
- **Response Time**: 2 hours
- **Contact**: [Contact method]
- **Capabilities**: Full access
8. Troubleshooting
Troubleshooting Procedures
# Troubleshooting
## Common Issues
### Issue 1: Service Not Responding
**Symptoms**
- 503 Service Unavailable errors
- Connection timeouts
- Health check failing
**Diagnostic Steps**
```bash
# Check if service is running
systemctl status [service-name]
# Check logs
tail -f /var/log/app/error.log
# Check CPU and memory
top
# Check network connectivity
ping [service-hostname]
Resolution Steps
# Restart service
systemctl restart [service-name]
# Verify service is running
systemctl status [service-name]
# Check health endpoint
curl https://api.example.com/health
If Issue Persists
- Check recent deployments
- Review recent code changes
- Check database connectivity
- Check external dependencies
- Escalate to team lead
Issue 2: High Error Rate
Symptoms
- Increased error rate in monitoring
- User complaints
- Error logs increasing
Diagnostic Steps
# Check error logs
tail -f /var/log/app/error.log
# Check error rate in monitoring
# Visit monitoring dashboard
# Check recent deployments
git log --oneline -10
# Check database status
./scripts/check-db.sh
Resolution Steps
# If recent deployment, rollback
./scripts/rollback.sh
# If database issue, restart database
systemctl restart postgresql
# If external dependency issue, check status
curl https://external-api.com/health
If Issue Persists
- Check external dependencies
- Review recent code changes
- Check for configuration changes
- Escalate to team lead
Issue 3: Slow Response Time
Symptoms
- Increased response times
- User complaints
- Performance degradation
Diagnostic Steps
# Check CPU and memory
top
# Check database queries
./scripts/check-db-queries.sh
# Check network latency
ping [database-hostname]
# Check disk I/O
iostat -x 1
Resolution Steps
# Restart service
systemctl restart [service-name]
# Clear cache
./scripts/clear-cache.sh
# If database issue, optimize queries
./scripts/optimize-db.sh
If Issue Persists
- Check for resource contention
- Review recent code changes
- Check for external factors
- Escalate to team lead
Issue 4: Database Connection Issues
Symptoms
- Database connection errors
- Slow queries
- Timeouts
Diagnostic Steps
# Check database status
systemctl status postgresql
# Check database logs
tail -f /var/log/postgresql/postgresql.log
# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Test database connectivity
psql -h [hostname] -U [username] -d [database]
Resolution Steps
# Restart database
systemctl restart postgresql
# Check connection pool
./scripts/check-connection-pool.sh
# If connection limit reached, increase max_connections
If Issue Persists
- Check database configuration
- Review recent schema changes
- Check for long-running queries
- Escalate to DBA
---
## 9. Emergency Procedures
### Emergency Response
```markdown
# Emergency Procedures
## Critical Incident Response
### Step 1: Acknowledge Alert
- [ ] Acknowledge in monitoring system
- [ ] Update incident status
- [ ] Notify team
### Step 2: Assess Impact
- [ ] Determine scope of issue
- [ ] Identify affected users
- [ ] Estimate business impact
### Step 3: Initial Mitigation
- [ ] Implement quick fix if possible
- [ ] Disable affected features
- [ ] Route traffic to healthy instances
### Step 4: Root Cause Analysis
- [ ] Review logs
- [ ] Check metrics
- [ ] Identify root cause
### Step 5: Permanent Fix
- [ ] Implement fix
- [ ] Test fix
- [ ] Deploy fix
### Step 6: Verification
- [ ] Verify issue is resolved
- [ ] Monitor for recurrence
- [ ] Document incident
## Emergency Contacts
### On-Call Engineer
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
### Team Lead
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
### Engineering Manager
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
### VP of Engineering
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
## Communication Templates
### Incident Announcement
🚨 INCIDENT ANNOUNCEMENT 🚨
Service: [Service Name] Severity: [Critical/High/Medium/Low] Status: [Investigating/Identified/Monitoring/Resolved] Started: [Timestamp]
Description: [Brief description of the issue]
Impact: [Description of impact]
Next Update: [Time]
### Incident Update
📊 INCIDENT UPDATE 📊
Service: [Service Name] Status: [Investigating/Identified/Monitoring/Resolved]
Update: [Description of update]
Next Update: [Time]
### Incident Resolution
✅ INCIDENT RESOLVED ✅
Service: [Service Name] Duration: [Duration] Root Cause: [Root cause] Resolution: [Resolution applied]
Post-Incident Review: [Date and time of post-incident review]
## Escalation Procedures
### When to Escalate
- Issue not resolved in [X] minutes
- Severity level increases
- Multiple services affected
- Business impact critical
### Escalation Process
1. Notify next level in escalation path
2. Provide incident summary
3. Hand off incident ownership
4. Update incident status
10. Disaster Recovery
Disaster Recovery Plan
# Disaster Recovery
## Recovery Objectives
### Recovery Time Objective (RTO)
- **Critical Services**: 4 hours
- **Important Services**: 8 hours
- **Non-Critical Services**: 24 hours
### Recovery Point Objective (RPO)
- **Critical Services**: 15 minutes
- **Important Services**: 1 hour
- **Non-Critical Services**: 24 hours
## Backup Strategy
### Database Backups
- **Frequency**: Every 15 minutes
- **Retention**: 30 days
- **Location**: [Backup location]
- **Type**: Incremental + Full daily
### Application Backups
- **Frequency**: Daily
- **Retention**: 90 days
- **Location**: [Backup location]
- **Type**: Full
### Configuration Backups
- **Frequency**: On change
- **Retention**: 365 days
- **Location**: [Backup location]
- **Type**: Version control
## Recovery Procedures
### Database Recovery
```bash
# Stop application
systemctl stop [service-name]
# Restore from backup
pg_restore -d [database] [backup-file]
# Verify data
psql -d [database] -c "SELECT COUNT(*) FROM [table];"
# Start application
systemctl start [service-name]
# Verify application
curl https://api.example.com/health
Application Recovery
# Restore application backup
./scripts/restore-app.sh [backup-id]
# Verify configuration
./scripts/verify-config.sh
# Start services
systemctl start [service-name]
# Verify health
curl https://api.example.com/health
Failover Procedures
Primary to Secondary
# Promote secondary to primary
./scripts/promote-secondary.sh
# Update DNS
./scripts/update-dns.sh
# Verify failover
curl https://api.example.com/health
Secondary to Primary
# Restore primary
./scripts/restore-primary.sh
# Update DNS
./scripts/update-dns.sh
# Verify restoration
curl https://api.example.com/health
Testing
Backup Testing
- Frequency: Monthly
- Scope: Random backup
- Verification: Data integrity
- Documentation: Test results
Recovery Testing
- Frequency: Quarterly
- Scope: Full recovery drill
- Verification: Service functionality
- Documentation: Test results
Failover Testing
- Frequency: Semi-annually
- Scope: Full failover
- Verification: Service continuity
- Documentation: Test results
---
## 11. Maintenance Procedures
### Maintenance Tasks
```markdown
# Maintenance
## Routine Maintenance
### Daily Tasks
- [ ] Review logs for errors
- [ ] Check disk space
- [ ] Verify backup completion
- [ ] Review monitoring alerts
### Weekly Tasks
- [ ] Review performance metrics
- [ ] Check for security updates
- [ ] Review system logs
- [ ] Clean up temporary files
### Monthly Tasks
- [ ] Review and rotate logs
- [ ] Update documentation
- [ ] Review capacity planning
- [ ] Test backup restoration
### Quarterly Tasks
- [ ] Review and update runbooks
- [ ] Conduct disaster recovery test
- [ ] Review security posture
- [ ] Update software versions
## Backup and Restore
### Backup Procedures
```bash
# Database backup
pg_dump -U [username] -d [database] > [backup-file].sql
# Application backup
tar -czf [backup-file].tar.gz /var/www/app
# Configuration backup
cp /etc/app/config.yaml /backups/config-$(date +%Y%m%d).yaml
Restore Procedures
# Database restore
psql -U [username] -d [database] < [backup-file].sql
# Application restore
tar -xzf [backup-file].tar.gz -C /var/www/
# Configuration restore
cp /backups/config-[date].yaml /etc/app/config.yaml
Health Checks
Application Health Check
# Run health check
curl https://api.example.com/health
# Expected response
{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2024-01-15T10:00:00Z"
}
Database Health Check
# Check database status
systemctl status postgresql
# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check database size
psql -c "SELECT pg_size_pretty(pg_database_size('[database]'));"
System Health Check
# Check CPU
top -bn1 | grep "Cpu(s)"
# Check memory
free -h
# Check disk
df -h
# Check network
ping -c 4 [hostname]
Performance Tuning
Database Tuning
# Analyze query performance
./scripts/analyze-queries.sh
# Update statistics
psql -c "ANALYZE;"
# Reindex if needed
psql -c "REINDEX DATABASE [database];"
Application Tuning
# Review and update configuration
vim /etc/app/config.yaml
# Restart service
systemctl restart [service-name]
# Monitor performance
./scripts/monitor-performance.sh
System Tuning
# Update system parameters
vim /etc/sysctl.conf
# Apply changes
sysctl -p
# Reboot if needed
reboot
---
## 12. Tools and Access
### Required Tools
```markdown
# Tools and Access
## Required Tools
### Command Line Tools
- **SSH**: For server access
- **curl**: For API testing
- **jq**: For JSON parsing
- **psql**: For database access
### Monitoring Tools
- **Grafana**: For dashboards
- **Prometheus**: For metrics
- **Kibana**: For log analysis
### Development Tools
- **Git**: For version control
- **Docker**: For container management
- **Kubernetes**: For orchestration
## Access Requirements
### Server Access
- **SSH Key**: Required for SSH access
- **VPN**: Required for remote access
- **MFA**: Required for authentication
### Database Access
- **Username**: [Database username]
- **Password**: [Password manager reference]
- **Connection String**: [Connection string]
### API Access
- **API Key**: [API key reference]
- **Authentication**: OAuth token
- **Base URL**: https://api.example.com
## Command References
### Service Management
```bash
# Start service
systemctl start [service-name]
# Stop service
systemctl stop [service-name]
# Restart service
systemctl restart [service-name]
# Check status
systemctl status [service-name]
# View logs
journalctl -u [service-name] -f
Database Commands
# Connect to database
psql -h [hostname] -U [username] -d [database]
# Run query
psql -c "SELECT * FROM [table] LIMIT 10;"
# Import data
psql -d [database] < [file].sql
# Export data
pg_dump -U [username] -d [database] > [file].sql
Monitoring Commands
# Check CPU
top
# Check memory
free -h
# Check disk
df -h
# Check processes
ps aux
Useful Scripts
Health Check Script
#!/bin/bash
# health-check.sh
echo "Checking service health..."
# Check service status
systemctl is-active --quiet [service-name]
if [ $? -eq 0 ]; then
echo "✓ Service is running"
else
echo "✗ Service is not running"
exit 1
fi
# Check health endpoint
curl -f https://api.example.com/health
if [ $? -eq 0 ]; then
echo "✓ Health check passed"
else
echo "✗ Health check failed"
exit 1
fi
echo "All checks passed!"
Backup Script
#!/bin/bash
# backup.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups"
# Database backup
pg_dump -U [username] -d [database] > $BACKUP_DIR/db_$DATE.sql
# Application backup
tar -czf $BACKUP_DIR/app_$DATE.tar.gz /var/www/app
echo "Backup completed: $DATE"
Restore Script
#!/bin/bash
# restore.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup-file>"
exit 1
fi
# Restore database
psql -U [username] -d [database] < $BACKUP_FILE
echo "Restore completed"
---
## 13. Contacts
### Contact Information
```markdown
# Contacts
## On-Call Rotation
### Current On-Call
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
- **Time Zone**: [Time zone]
### Upcoming Schedule
| Date | Name | Phone | Slack |
|------|------|-------|-------|
| [Date] | [Name] | [Phone] | [@username] |
| [Date] | [Name] | [Phone] | [@username] |
| [Date] | [Name] | [Phone] | [@username] |
## Subject Matter Experts
### Database
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Database administration, optimization
### Security
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Security, compliance
### Infrastructure
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Infrastructure, networking
## Escalation Contacts
### Team Lead
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
### Engineering Manager
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
### VP of Engineering
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
## Support Channels
### Internal Support
- **Slack Channel**: #support-[service-name]
- **Email**: support@example.com
- **Ticket System**: https://support.example.com
### External Support
- **Vendor**: [Vendor name]
- **Support Email**: [Email]
- **Support Phone**: [Phone]
- **Support Portal**: [URL]
### Emergency Contacts
- **IT Security**: [Phone]
- **Data Center**: [Phone]
- **Cloud Provider**: [Phone]
14. Testing Runbooks
Runbook Testing Checklist
# Testing Runbooks
## Testing Checklist
### Accuracy Testing
- [ ] All commands work as documented
- [ ] All links are valid
- [ ] All contact information is current
- [ ] All procedures are complete
### Completeness Testing
- [ ] All scenarios are covered
- [ ] All edge cases are addressed
- [ ] All dependencies are documented
- [ ] All tools are listed
### Usability Testing
- [ ] Procedures are clear and concise
- [ ] Steps are in logical order
- [ ] Examples are helpful
- [ ] Troubleshooting is comprehensive
### Emergency Testing
- [ ] Emergency procedures work
- [ ] Escalation paths are clear
- [ ] Contact information is accurate
- [ ] Communication templates are usable
## Testing Schedule
### Regular Testing
- **Frequency**: Quarterly
- **Scope**: All procedures
- **Participants**: On-call team
- **Documentation**: Test results
### Emergency Testing
- **Frequency**: Semi-annually
- **Scope**: Emergency procedures
- **Participants**: All stakeholders
- **Documentation**: Test results and improvements
## Test Documentation
### Test Report Template
```markdown
# Runbook Test Report
**Date**: [Date]
**Tester**: [Name]
**Runbook**: [Runbook Name]
## Test Results
### Accuracy
- [ ] Commands tested
- [ ] Links verified
- [ ] Contacts validated
### Completeness
- [ ] Scenarios covered
- [ ] Edge cases addressed
- [ ] Dependencies documented
### Usability
- [ ] Procedures clear
- [ ] Steps logical
- [ ] Examples helpful
### Emergency
- [ ] Procedures work
- [ ] Escalation clear
- [ ] Contacts accurate
## Issues Found
1. [Issue description]
2. [Issue description]
3. [Issue description]
## Recommendations
1. [Recommendation]
2. [Recommendation]
3. [Recommendation]
## Next Review Date
[Date]
---
## 15. Templates
### Runbook Templates
```markdown
# Runbook Template
# [Service Name] Runbook
## Service Overview
[Fill in service overview]
## Architecture
[Fill in architecture details]
## Deployment
[Fill in deployment procedures]
## Monitoring
[Fill in monitoring details]
## Alerts
[Fill in alert configuration]
## Troubleshooting
[Fill in troubleshooting procedures]
## Emergency Procedures
[Fill in emergency procedures]
## Disaster Recovery
[Fill in disaster recovery plan]
## Maintenance
[Fill in maintenance procedures]
## Tools and Access
[Fill in tools and access details]
## Contacts
[Fill in contact information]
## Version History
| Version | Date | Author | Changes |
|---------|------|---------|---------|
| 1.0 | [Date] | [Name] | Initial version |
| 1.1 | [Date] | [Name] | [Changes] |
16. Best Practices
Runbook Best Practices
# Best Practices
## 1. Keep Current
- Update regularly
- Review quarterly
- Test procedures
- Archive old versions
## 2. Be Clear
- Use simple language
- Avoid jargon
- Provide context
- Explain decisions
## 3. Be Complete
- Cover all scenarios
- Include edge cases
- Document dependencies
- Provide examples
## 4. Be Accessible
- Use standard formats
- Support search
- Provide navigation
- Include glossary
## 5. Be Practical
- Focus on what works
- Provide real solutions
- Include workarounds
- Share lessons learned
## 6. Be Collaborative
- Involve team
- Get feedback
- Update regularly
- Share knowledge
Quick Start
Basic Runbook Template
# Runbook: [Task Name]
## Overview
[Brief description of what this runbook covers]
## Prerequisites
- [ ] Requirement 1
- [ ] Requirement 2
## Steps
### Step 1: [Action]
1. Do this
2. Then this
3. Verify result
### Step 2: [Action]
[Detailed steps]
## Troubleshooting
### Problem: [Issue]
**Symptoms:** [What you see]
**Solution:** [How to fix]
## Rollback
If something goes wrong:
1. Step 1
2. Step 2
## Related Runbooks
- [Related runbook 1]
- [Related runbook 2]
Production Checklist
- Clear Purpose: Each runbook has clear purpose and scope
- Step-by-Step: Instructions are numbered and sequential
- Prerequisites: All prerequisites listed and verified
- Verification: Each step includes verification criteria
- Troubleshooting: Common issues and solutions documented
- Rollback: Rollback procedures included for risky operations
- Testing: Runbooks tested by someone other than author
- Updates: Runbooks reviewed and updated regularly
- Accessibility: Runbooks easily accessible to team
- Version Control: Runbooks versioned and tracked
- Feedback: Mechanism for team to provide feedback
- Links: Related runbooks and documentation linked
Anti-patterns
❌ Don't: Vague Instructions
# ❌ Bad - Unclear
## Step 1: Restart the service
[No details on how]
# ✅ Good - Specific
## Step 1: Restart the service
1. SSH into server: `ssh user@server`
2. Check service status: `systemctl status myservice`
3. Restart service: `sudo systemctl restart myservice`
4. Verify: `systemctl status myservice` (should show "active")
❌ Don't: Missing Prerequisites
# ❌ Bad - No prerequisites
## Steps
1. Run deployment script
# ✅ Good - Prerequisites listed
## Prerequisites
- [ ] Access to production environment
- [ ] Database backup completed
- [ ] Team notified of maintenance window
## Steps
1. Run deployment script
❌ Don't: No Troubleshooting
# ❌ Bad - No help when things go wrong
## Steps
1. Deploy application
[What if it fails?]
# ✅ Good - Troubleshooting included
## Steps
1. Deploy application
## Troubleshooting
### Problem: Deployment fails
**Symptoms:** Error message "Connection timeout"
**Solution:**
1. Check network connectivity
2. Verify credentials
3. Retry with increased timeout
Integration Points
- Technical Writing (
21-documentation/technical-writing/) - Clear documentation - System Architecture (
21-documentation/system-architecture-docs/) - System context - Incident Management (
41-incident-management/) - Incident response runbooks
Further Reading
- Google SRE Runbook
- Runbook Best Practices
- Get peer reviews
- Share knowledge
- Use version control
- Build consensus
7. Be Tested
- Test all procedures
- Verify all commands
- Check all links
- Validate all contacts
8. Be Useful
- Focus on audience
- Answer questions
- Solve problems
- Enable decisions
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?