Agent skills
Operational Runbooks

Agent skill

Operational Runbooks

Step-by-step procedures for operating and troubleshooting systems, including incident response, maintenance tasks, and operational workflows.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/runbooks

SKILL.md

Operational Runbooks

Current Level: Expert (Enterprise Scale) Domain: Documentation / Operations

Overview

Runbooks provide step-by-step procedures for operating and troubleshooting systems. Effective runbooks enable teams to handle incidents, perform maintenance, and operate systems consistently with clear, actionable instructions.

1. Executive Summary & Strategic Necessity

Context: ในปี 2025-2026 Operational Runbooks ด้วย Best Practices ช่วย Operations ที่มีอัตโนมาติการทำงานอัตโนมาติ (Operational Excellence) ใน Enterprise Scale
Business Impact: Operational Runbooks ช่วยลด Downtime ของระบบ Operations ผ่านการตอบคำถามอัตโนมาติการเขียนเอกสาร (Reduce MTTR), ลดต้นทุนการจัดการทีม (Increase operational efficiency), เพิ่มอัตรากำไร Gross Margin ผ่านการทำงานอัตโนมาติ (Faster incident resolution), และปรับประสบทการทำงาน (Consistent operations)
Product Thinking: Operational Runbooks ช่วยแก้ปัญหา (Pain Point) ความต้องการมีการเขียนเอกสารที่ชัดเจน (Teams need clear operational procedures) ผ่านการทำงานอัตโนมาติ (Standardized procedures)

2. Technical Deep Dive (The "How-to")

Core Logic: Operational Runbooks ใช้ Best Practices ช่วย Operations ทำงานอัตโนมาติ:
1. Runbook Structure: กำหนด Runbook Structure (Service Overview, Architecture, Deployment, Monitoring, Alerts, Troubleshooting, Emergency Procedures, Maintenance, Tools and Access, Contacts)
2. Runbook Templates: สร้าง Runbook Templates สำหรับการเขียน runbooks (Basic template, Advanced template)
3. Runbook Testing: จัดการ Runbook Testing (Accuracy testing, Completeness testing, Usability testing, Emergency testing)
4. Runbook Maintenance: สร้าง Runbook Maintenance สำหรับการดูแล runbooks (Regular updates, Quarterly reviews, Testing procedures)
5. Agent-Friendly Runbooks: สร้าง Agent-Friendly Runbooks สำหรับ AI Agents (Clear steps, Verification criteria, Troubleshooting procedures)
Architecture Diagram Requirements: แผนผังระบบ Operational Runbooks ต้องมีองค์ประกอบ:
1. Runbook Repository: Runbook Repository สำหรับการจัดเก็บ runbooks (Git repository, Documentation platform)
2. Runbook Templates: Runbook Templates สำหรับการเขียน runbooks (Basic template, Advanced template)
3. Runbook Testing Framework: Runbook Testing Framework สำหรับการทดสอบ runbooks (Accuracy testing, Completeness testing, Usability testing)
4. Runbook Maintenance System: Runbook Maintenance System สำหรับการดูแล runbooks (Version control, Review process, Update notifications)
5. Runbook Search: Runbook Search สำหรับการค้นหา runbooks (Full-text search, Tag-based search, Category-based search)
6. Runbook Analytics: Runbook Analytics สำหรับการวิเคราะห์ runbook usage (Usage metrics, Search analytics, Feedback collection)
7. Observability: Logging, Monitoring, Tracing สำหรับการ debug และปรับสิทท
Implementation Workflow: ขั้นตอนการนำ Operational Runbooks ไปใช้งานจริง:
1. Planning Phase: กำหนด Requirement และเลือก Runbook Platform ที่เหมาะสม
2. Runbook Repository Setup: ตั้งค่า Runbook Repository สำหรับการจัดเก็บ runbooks
3. Runbook Templates Creation: สร้าง Runbook Templates สำหรับการเขียน runbooks
4. Runbook Testing Framework Setup: ตั้งค่า Runbook Testing Framework สำหรับการทดสอบ runbooks
5. Runbook Maintenance System Setup: ตั้งค่า Runbook Maintenance System สำหรับการดูแล runbooks
6. Testing Phase: Unit test, Integration test, E2E test ด้วยจริง Scenario
7. Deployment: Deploy ด้วย CI/CD pipeline, Set up Monitoring
8. Optimization: Optimize runbook search, Add runbook templates, Improve UX
9. Maintenance: Monitor runbook usage, Update runbook content, Handle edge cases

3. Tooling & Tech Stack

Enterprise Tools: เครื่องมือระดับอุตสาหกรรมที่เลือกใช้สำหรับ Operational Runbooks ใน Enterprise Scale:
1. Git: Version control system สำหรับ runbook versioning
2. GitHub/GitLab: Git hosting platforms สำหรับ runbook collaboration
3. Confluence: Documentation platform สำหรับ runbook storage
4. Notion: Documentation platform สำหรับ runbook storage
5. GitBook: Documentation platform สำหรับ runbook storage
6. Algolia: Search platform สำหรับ runbook search
7. Elasticsearch: Search platform สำหรับ runbook search
8. Grafana: Monitoring platform สำหรับ runbook analytics
9. Prometheus: Metrics platform สำหรับ runbook analytics
10. PagerDuty: Incident management platform สำหรับ runbook integration
Configuration Essentials: การตั้งค่าสำคัญสำหรับให้ระบบเสถียร Operational Runbooks:
1. Runbook Templates: ตั้งค่า Runbook Templates (Basic template, Advanced template)
2. Runbook Testing: ตั้งค่า Runbook Testing (Accuracy testing, Completeness testing, Usability testing)
3. Runbook Maintenance: ตั้งค่า Runbook Maintenance (Version control, Review process, Update notifications)
4. Runbook Search: ตั้งค่า Runbook Search (Full-text search, Tag-based search, Category-based search)
5. Monitoring: ตั้งค่า Monitoring สำหรับ tracking runbook usage (Usage metrics, Search analytics, Feedback collection)
6. Secret Management: Use Environment variables หรือ Secret Manager (AWS Secrets Manager, HashiCorp Vault)
7. Rate Limiting: Per-user และ Per-IP rate limits สำหรับป้องกัน Abuse (100-1000 requests/hour)
8. Logging Level: INFO สำหรับ Production, DEBUG สำหรับ Development
9. Observability: Track success rate, runbook usage, search accuracy ต่อเป้าหลาย
10. Documentation: Maintain Runbook Platform documentation สำหรับ team reference

4. Standards, Compliance & Security

International Standards: มาตรฐานที่เกี่ยวข้อง:
1. ISO/IEC 27001: Information Security Management - สำหรับการจัดการ Secrets และ Access Control
2. ISO/IEC 27017: Code of Practice for Information Security Controls - สำหรับ Secure Operations
3. ISO/IEC 22301: Business Continuity Management - สำหรับ Disaster Recovery
4. GDPR: General Data Protection Regulation - สำหรับการจัดการ Personal Data และ User Consent
5. SOC 2 Type II: Security Controls - สำหรับการ Audit และ Compliance
Security Protocol: กลไกการป้องกัน Operational Runbooks:
1. Input Validation: Validate และ Sanitize ทุก Input ก่อน processing (Prevent XSS, SQL injection)
2. Output Sanitization: Filter sensitive information จาก runbooks (API keys, Secrets, Passwords)
3. Access Control: RBAC (Role-Based Access Control) สำหรับ runbook access - บาง runbooks internal only
4. Audit Trail: Log ทุก runbook access ด้วย Timestamp, User ID, และ Runbook accessed (สำหรับ Forensics และ Compliance)
5. Rate Limiting: Per-user และ Per-IP rate limits สำหรับป้องกัน Abuse (100-1000 requests/hour)
6. Secure Communication: TLS 1.3 สำหรับ HTTPS access
7. Secret Management: Use Environment variables หรือ Secret Manager (AWS Secrets Manager, HashiCorp Vault)
8. Content Security: CSP headers สำหรับ preventing XSS attacks
9. Authentication: Implement authentication สำหรับ internal runbooks (SSO, OAuth)
10. Data Encryption: Encrypt sensitive data ที่ rest ใน Database (AES-256 หรือ Customer-managed keys)
Explainability: (สำหรับ Runbooks) ความสามารถในการอธิบายผลลัพธ์ผ่านเทคนิค:
1. Clear Structure: เก็บ runbook structure สำหรับ easy understanding
2. Detailed Steps: Provide detailed steps สำหรับ complex procedures
3. Verification Criteria: Include verification criteria สำหรับ each step
4. Troubleshooting Procedures: Provide troubleshooting procedures สำหรับ common issues
5. Reference Links: Link to external documentation สำหรับ complex topics

5. Unit Economics & Performance Metrics (KPIs)

Cost Calculation: สูตรการคำนวณต้นทุกต่อหน่วย Operational Runbooks:
1. Platform Cost = Platform subscription × Cost per user/month
  - Confluence: $5-15/user/month
  - Notion: $8-15/user/month
  - GitBook: $8-20/user/month
2. Storage Cost = Runbook storage × Cost per GB/month
  - GitHub Pages: Free
  - GitLab Pages: Free
  - S3: $0.023/GB/month
3. Search Cost = Search API calls × Cost per 1000 calls
  - Algolia: $1-5/1000 calls
  - Elasticsearch: $0-100/month (self-hosted)
4. Monitoring Cost = Monitoring platform subscription × Cost per month
  - Grafana Cloud: $50-500/month
  - Datadog: $15-23/host/month
5. Total Monthly Cost = Platform Cost + Storage Cost + Search Cost + Monitoring Cost
6. Infrastructure Costs = Compute ($0/month for static sites) + Storage ($0/month for static sites) + Monitoring ($50-500/month)
Key Performance Indicators: ตัวชี้วัดความสำเร็จทางเทคนิค:
1. Runbook Coverage: เปอร์เซ็นต์ของ services ที่มี runbooks (Target: >80%)
2. Runbook Quality Score: คะแนน runbook quality จาก automated checks (Target: >4.0)
3. Runbook Accuracy: เปอร์เซ็นต์ของ runbooks ที่มี accurate procedures (Target: >95%)
4. Runbook Completeness: เปอร์เซ็นต์ของ runbooks ที่มี complete procedures (Target: >90%)
5. Operator Satisfaction Score: 1-5 rating จาก Operator feedback (Target: >4.0)
6. Error Rate: อัตราการ Error (Target: <1%)
7. MTTR: Mean Time To Resolution (Target: <30 minutes)
8. Runbook Usage: เปอร์เซ็นต์ของ operators ที่ใช้ runbooks (Target: >80%)
9. Search Accuracy: เปอร์เซ็นต์ของ search results ที่ relevant (Target: >90%)
10. Knowledge Transfer: เปอร์เซ็นต์ของ knowledge transfer (Target: >80%)

6. Strategic Recommendations (CTO Insights)

Phase Rollout: คำแนะนำในการทยอยเริ่มใช้งาน Operational Runbooks เพื่อลดความเสี่ยง:
1. Phase 1: MVP (1-2 เดือน): Deploy Simple Operational Runbooks ด้วย Basic Templates และ Manual review สำหรับ Internal team ก่อนเปิดให้ Public
  - Goal: Validate Operational Runbooks architecture และ gather feedback
  - Success Criteria: >80% runbook coverage, <30s search time
  - Risk Mitigation: Internal-only access, Manual review ก่อน Public
2. Phase 2: Beta (2-3 เดือน): Expand ด้วย Advanced Templates และ Runbook Testing Framework สำหรับ Selected customers
  - Goal: Test scalability และ Runbook reliability
  - Success Criteria: >90% runbook coverage, <15s search time
  - Risk Mitigation: Canary deployment, Feature flags, Gradual rollout
3. Phase 3: GA (3-6 เดือน): Full rollout ด้วย Advanced features (Runbook Search, Runbook Analytics, Agent-Friendly Runbooks)
  - Goal: Enterprise-grade runbook quality และ Performance
  - Success Criteria: >95% runbook coverage, <10s search time, 99.9% uptime
  - Risk Mitigation: Load testing, Disaster recovery, Blue-green deployment
Pitfalls to Avoid: ข้อควรระวังที่มักจะผิดพลาดในระดับ Enterprise Scale:
1. Over-engineering: สร้าง Operational Runbooks ที่ซ้อนเกินไป (Too many sections, Complex templates) → เริ่มจาก Simple และ iterate
2. No Runbook Templates: ไม่มี Runbook Templates ทำให้ consistency ลด → Implement Runbook Templates สำหรับ common patterns
3. Outdated Runbooks: Runbooks ไม่ sync กับ systems → Implement automated runbook quality checks
4. Missing Runbook Testing: ไม่มี Runbook Testing ทำให้ runbooks ไม่ reliable → Implement Runbook Testing Framework
5. No Runbook Maintenance: ไม่มี Runbook Maintenance ทำให้ runbooks outdated → Implement Runbook Maintenance System
6. No Agent-Friendly Runbooks: ไม่มี Agent-Friendly Runbooks ทำให้ AI Agents สับสนใจ → Implement Agent-Friendly runbook guidelines
7. Poor Search: Search ไม่ดีทำให้ operators หา runbooks ไม่ได้ → Implement advanced search (Algolia, Elasticsearch)
8. No Runbook Analytics: ไม่มี Runbook Analytics ทำให้ไม่รู้ runbook usage → Implement runbook analytics
9. No Runbook Documentation: ไม่มี Runbook Documentation ทำให้ team reference → Maintain Runbook Platform documentation
10. Single Point of Failure: ไม่มี Redundancy หรือ Fallback → Deploy multiple instances ด้วย CDN

Core Concepts

1. Runbook Purpose

Why Runbooks Matter

markdown

# Runbook Purpose

## Benefits

### 1. Operational Excellence
- Standardized procedures
- Consistent operations
- Reduced errors
- Improved reliability

### 2. Knowledge Management
- Preserves institutional knowledge
- Onboards new team members
- Reduces dependency on individuals
- Enables knowledge sharing

### 3. Incident Response
- Faster resolution
- Reduced MTTR (Mean Time To Resolution)
- Clear communication
- Better coordination

### 4. Compliance and Auditing
- Documented procedures
- Audit trails
- Regulatory compliance
- Process improvement

## Consequences of Poor Runbooks

### 1. Operational Issues
- Inconsistent procedures
- Human errors
- Longer resolution times
- Increased downtime

### 2. Knowledge Loss
- When team members leave
- Over time
- During incidents
- Under pressure

### 3. Increased Costs
- More incidents
- Longer resolution times
- Higher support costs
- Lost revenue

### 4. Team Stress
- Uncertainty during incidents
- Lack of confidence
- Burnout
- Turnover

2. Runbook Structure

Standard Runbook Structure

markdown

# [Service Name] Runbook

## Service Overview
- Service description
- Purpose and scope
- Criticality level
- Dependencies

## Architecture
- System architecture
- Components and their roles
- Data flow
- Integration points

## Deployment
- Deployment process
- Deployment environments
- Rollback procedures
- Deployment verification

## Monitoring
- Key metrics
- Alert thresholds
- Monitoring dashboards
- Log sources

## Alerts
- Alert definitions
- Severity levels
- Notification channels
- Escalation paths

## Troubleshooting
- Common issues
- Diagnostic steps
- Resolution procedures
- Known workarounds

## Emergency Procedures
- Critical incidents
- Emergency contacts
- Escalation procedures
- Communication templates

## Maintenance
- Maintenance procedures
- Backup and restore
- Health checks
- Performance tuning

## Tools and Access
- Required tools
- Access requirements
- Command references
- Useful scripts

## Contacts
- On-call rotation
- Subject matter experts
- Escalation contacts
- Support channels

3. Service Overview

Service Overview Template

markdown

# Service Overview

## Service Name
[Service Name]

## Description
[Brief description of what the service does and its purpose]

## Criticality
- **Level**: [Critical | High | Medium | Low]
- **SLA**: [Service Level Agreement details]
- **RTO**: [Recovery Time Objective]
- **RPO**: [Recovery Point Objective]

## Purpose and Scope
- **Primary Purpose**: [Main function]
- **Key Features**: [List of key features]
- **Target Users**: [Who uses this service]
- **Service Boundaries**: [What's in scope and what's not]

## Dependencies

### Upstream Dependencies
- [Dependency 1] - [Purpose]
- [Dependency 2] - [Purpose]
- [Dependency 3] - [Purpose]

### Downstream Dependencies
- [Dependency 1] - [Purpose]
- [Dependency 2] - [Purpose]
- [Dependency 3] - [Purpose]

## Technology Stack
- **Language**: [Programming language]
- **Framework**: [Framework name]
- **Database**: [Database type and version]
- **Infrastructure**: [Cloud provider, hosting]
- **Other**: [Other relevant technologies]

## Key Metrics
- **Throughput**: [Requests per second]
- **Latency**: [P50, P95, P99]
- **Error Rate**: [Target error rate]
- **Availability**: [Target uptime percentage]

## Known Limitations
- [Limitation 1]
- [Limitation 2]
- [Limitation 3]

4. Architecture

Architecture Documentation

markdown

# Architecture

## System Architecture

### High-Level Architecture
```mermaid
graph TB
    subgraph "Service"
        Component1[Component 1]
        Component2[Component 2]
        Component3[Component 3]
    end
    
    subgraph "Data Layer"
        DB[(Database)]
        Cache[(Cache)]
    end
    
    subgraph "External"
        API[External API]
        Queue[Message Queue]
    end
    
    Component1 --> Component2
    Component2 --> Component3
    Component2 --> DB
    Component3 --> Cache
    Component1 --> API
    Component3 --> Queue

Components

Component 1: [Component Name]

Purpose: [What it does]
Technology: [Technology stack]
Dependencies: [What it depends on]
Interfaces: [APIs, protocols]
Scaling: [Horizontal/Vertical, limits]

Component 2: [Component Name]

Purpose: [What it does]
Technology: [Technology stack]
Dependencies: [What it depends on]
Interfaces: [APIs, protocols]
Scaling: [Horizontal/Vertical, limits]

Component 3: [Component Name]

Purpose: [What it does]
Technology: [Technology stack]
Dependencies: [What it depends on]
Interfaces: [APIs, protocols]
Scaling: [Horizontal/Vertical, limits]

Data Flow

Request Flow

Client → Load Balancer
Load Balancer → API Gateway
API Gateway → Service
Service → Database
Database → Service
Service → Client

Data Flow Diagram

mermaid

sequenceDiagram
    participant Client
    participant LB
    participant API
    participant DB
    
    Client->>LB: Request
    LB->>API: Forward Request
    API->>DB: Query
    DB-->>API: Data
    API-->>LB: Response
    LB-->>Client: Response

Integration Points

External Integrations

[Integration 1]: [Description, API docs link]
[Integration 2]: [Description, API docs link]
[Integration 3]: [Description, API docs link]

Internal Integrations

[Integration 1]: [Description, API docs link]
[Integration 2]: [Description, API docs link]
[Integration 3]: [Description, API docs link]


---

## 5. Deployment

### Deployment Procedures

```markdown
# Deployment

## Deployment Process

### Prerequisites
- [ ] Access to deployment environment
- [ ] Valid deployment credentials
- [ ] Deployment tools installed
- [ ] Database migrations prepared
- [ ] Feature flags configured
- [ ] Monitoring configured

### Deployment Steps

#### 1. Pre-Deployment Checks
```bash
# Run health checks
curl https://api.example.com/health

# Check database connectivity
./scripts/check-db.sh

# Verify configuration
./scripts/verify-config.sh

2. Build and Test

bash

# Build application
npm run build

# Run tests
npm test

# Run integration tests
npm run test:integration

3. Deploy

bash

# Deploy to staging
./scripts/deploy.sh staging

# Verify staging deployment
./scripts/verify.sh staging

# Deploy to production
./scripts/deploy.sh production

4. Post-Deployment Verification

bash

# Run smoke tests
./scripts/smoke-tests.sh

# Check metrics
./scripts/check-metrics.sh

# Verify logs
./scripts/check-logs.sh

Rollback Procedures

When to Rollback

Deployment fails
Critical errors detected
Performance degradation
Data corruption

Rollback Steps

bash

# Stop current deployment
./scripts/stop.sh

# Rollback to previous version
./scripts/rollback.sh

# Verify rollback
./scripts/verify.sh

# Start services
./scripts/start.sh

Deployment Environments

Development

URL: https://dev.example.com
Database: dev-db.example.com
Purpose: Development and testing
Data: Test data, can be reset

Staging

URL: https://staging.example.com
Database: staging-db.example.com
Purpose: Pre-production testing
Data: Production-like data

Production

URL: https://example.com
Database: prod-db.example.com
Purpose: Live production
Data: Real user data

Deployment Verification

Health Check Endpoint

bash

# Health check
curl https://api.example.com/health

# Expected response
{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2024-01-15T10:00:00Z"
}

Smoke Tests

bash

# Run smoke tests
npm run smoke-tests

# Expected output
✓ API is responding
✓ Database is accessible
✓ Cache is working
✓ External integrations are functional


---

## 6. Monitoring

### Monitoring Setup

```markdown
# Monitoring

## Key Metrics

### Application Metrics
- **Request Rate**: Requests per second
- **Response Time**: P50, P95, P99 latency
- **Error Rate**: Percentage of failed requests
- **Throughput**: Requests handled per second

### System Metrics
- **CPU Usage**: Percentage of CPU utilization
- **Memory Usage**: Percentage of memory utilization
- **Disk Usage**: Percentage of disk utilization
- **Network I/O**: Network traffic in/out

### Business Metrics
- **Active Users**: Number of active users
- **Transaction Volume**: Number of transactions
- **Conversion Rate**: Percentage of conversions
- **Revenue**: Revenue generated

## Alert Thresholds

### Warning Thresholds
- CPU > 70%
- Memory > 80%
- Error rate > 1%
- Response time P95 > 500ms

### Critical Thresholds
- CPU > 90%
- Memory > 95%
- Error rate > 5%
- Response time P95 > 2000ms

## Monitoring Dashboards

### Main Dashboard
- **URL**: https://grafana.example.com/d/main
- **Metrics**: All key metrics
- **Refresh**: 30 seconds

### Error Dashboard
- **URL**: https://grafana.example.com/d/errors
- **Metrics**: Error rates, error types
- **Refresh**: 10 seconds

### Performance Dashboard
- **URL**: https://grafana.example.com/d/performance
- **Metrics**: Response times, throughput
- **Refresh**: 30 seconds

## Log Sources

### Application Logs
- **Location**: /var/log/app/
- **Format**: JSON
- **Retention**: 30 days
- **Access**: SSH to server

### Access Logs
- **Location**: /var/log/nginx/access.log
- **Format**: Combined
- **Retention**: 90 days
- **Access**: SSH to server

### Error Logs
- **Location**: /var/log/app/error.log
- **Format**: JSON
- **Retention**: 90 days
- **Access**: SSH to server

## Monitoring Tools

### Application Monitoring
- **Tool**: [APM tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]

### Infrastructure Monitoring
- **Tool**: [Infrastructure tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]

### Log Aggregation
- **Tool**: [Log aggregation tool name]
- **URL**: [Dashboard URL]
- **Access**: [Access requirements]

7. Alerts

Alert Configuration

markdown

# Alerts

## Alert Definitions

### Alert 1: High CPU Usage
- **Name**: High CPU Usage
- **Condition**: CPU > 90% for 5 minutes
- **Severity**: Critical
- **Notification**: PagerDuty, Slack
- **Escalation**: 15 minutes

### Alert 2: High Error Rate
- **Name**: High Error Rate
- **Condition**: Error rate > 5% for 5 minutes
- **Severity**: Critical
- **Notification**: PagerDuty, Slack
- **Escalation**: 10 minutes

### Alert 3: High Memory Usage
- **Name**: High Memory Usage
- **Condition**: Memory > 95% for 5 minutes
- **Severity**: Warning
- **Notification**: Slack
- **Escalation**: 30 minutes

### Alert 4: Slow Response Time
- **Name**: Slow Response Time
- **Condition**: P95 > 2000ms for 5 minutes
- **Severity**: Warning
- **Notification**: Slack
- **Escalation**: 30 minutes

## Severity Levels

### Critical
- **Definition**: Service is down or severely degraded
- **Response Time**: < 15 minutes
- **Notification**: PagerDuty, Slack, Email
- **Escalation**: 15 minutes

### High
- **Definition**: Service is degraded but functional
- **Response Time**: < 30 minutes
- **Notification**: Slack, Email
- **Escalation**: 30 minutes

### Medium
- **Definition**: Potential issue detected
- **Response Time**: < 1 hour
- **Notification**: Slack
- **Escalation**: 1 hour

### Low
- **Definition**: Informational alert
- **Response Time**: < 4 hours
- **Notification**: Email
- **Escalation**: None

## Notification Channels

### PagerDuty
- **Service**: [Service name]
- **Escalation Policy**: [Policy name]
- **On-Call Schedule**: [Schedule name]

### Slack
- **Channel**: #alerts-[service-name]
- **Webhook**: [Webhook URL]

### Email
- **Recipients**: [Email list]
- **Subject**: [Alert prefix]

## Escalation Paths

### Level 1: On-Call Engineer
- **Response Time**: 15 minutes
- **Contact**: [Contact method]
- **Capabilities**: Full access

### Level 2: Team Lead
- **Response Time**: 30 minutes
- **Contact**: [Contact method]
- **Capabilities**: Full access

### Level 3: Engineering Manager
- **Response Time**: 1 hour
- **Contact**: [Contact method]
- **Capabilities**: Full access

### Level 4: VP of Engineering
- **Response Time**: 2 hours
- **Contact**: [Contact method]
- **Capabilities**: Full access

8. Troubleshooting

Troubleshooting Procedures

markdown

# Troubleshooting

## Common Issues

### Issue 1: Service Not Responding

**Symptoms**
- 503 Service Unavailable errors
- Connection timeouts
- Health check failing

**Diagnostic Steps**
```bash
# Check if service is running
systemctl status [service-name]

# Check logs
tail -f /var/log/app/error.log

# Check CPU and memory
top

# Check network connectivity
ping [service-hostname]

Resolution Steps

bash

# Restart service
systemctl restart [service-name]

# Verify service is running
systemctl status [service-name]

# Check health endpoint
curl https://api.example.com/health

If Issue Persists

Check recent deployments
Review recent code changes
Check database connectivity
Check external dependencies
Escalate to team lead

Issue 2: High Error Rate

Symptoms

Increased error rate in monitoring
User complaints
Error logs increasing

Diagnostic Steps

bash

# Check error logs
tail -f /var/log/app/error.log

# Check error rate in monitoring
# Visit monitoring dashboard

# Check recent deployments
git log --oneline -10

# Check database status
./scripts/check-db.sh

Resolution Steps

bash

# If recent deployment, rollback
./scripts/rollback.sh

# If database issue, restart database
systemctl restart postgresql

# If external dependency issue, check status
curl https://external-api.com/health

If Issue Persists

Check external dependencies
Review recent code changes
Check for configuration changes
Escalate to team lead

Issue 3: Slow Response Time

Symptoms

Increased response times
User complaints
Performance degradation

Diagnostic Steps

bash

# Check CPU and memory
top

# Check database queries
./scripts/check-db-queries.sh

# Check network latency
ping [database-hostname]

# Check disk I/O
iostat -x 1

Resolution Steps

bash

# Restart service
systemctl restart [service-name]

# Clear cache
./scripts/clear-cache.sh

# If database issue, optimize queries
./scripts/optimize-db.sh

If Issue Persists

Check for resource contention
Review recent code changes
Check for external factors
Escalate to team lead

Issue 4: Database Connection Issues

Symptoms

Database connection errors
Slow queries
Timeouts

Diagnostic Steps

bash

# Check database status
systemctl status postgresql

# Check database logs
tail -f /var/log/postgresql/postgresql.log

# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Test database connectivity
psql -h [hostname] -U [username] -d [database]

Resolution Steps

bash

# Restart database
systemctl restart postgresql

# Check connection pool
./scripts/check-connection-pool.sh

# If connection limit reached, increase max_connections

If Issue Persists

Check database configuration
Review recent schema changes
Check for long-running queries
Escalate to DBA


---

## 9. Emergency Procedures

### Emergency Response

```markdown
# Emergency Procedures

## Critical Incident Response

### Step 1: Acknowledge Alert
- [ ] Acknowledge in monitoring system
- [ ] Update incident status
- [ ] Notify team

### Step 2: Assess Impact
- [ ] Determine scope of issue
- [ ] Identify affected users
- [ ] Estimate business impact

### Step 3: Initial Mitigation
- [ ] Implement quick fix if possible
- [ ] Disable affected features
- [ ] Route traffic to healthy instances

### Step 4: Root Cause Analysis
- [ ] Review logs
- [ ] Check metrics
- [ ] Identify root cause

### Step 5: Permanent Fix
- [ ] Implement fix
- [ ] Test fix
- [ ] Deploy fix

### Step 6: Verification
- [ ] Verify issue is resolved
- [ ] Monitor for recurrence
- [ ] Document incident

## Emergency Contacts

### On-Call Engineer
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

### Team Lead
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

### Engineering Manager
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

### VP of Engineering
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

## Communication Templates

### Incident Announcement

🚨 INCIDENT ANNOUNCEMENT 🚨

Service: [Service Name] Severity: [Critical/High/Medium/Low] Status: [Investigating/Identified/Monitoring/Resolved] Started: [Timestamp]

Description: [Brief description of the issue]

Impact: [Description of impact]

Next Update: [Time]


### Incident Update

📊 INCIDENT UPDATE 📊

Service: [Service Name] Status: [Investigating/Identified/Monitoring/Resolved]

Update: [Description of update]

Next Update: [Time]


### Incident Resolution

✅ INCIDENT RESOLVED ✅

Service: [Service Name] Duration: [Duration] Root Cause: [Root cause] Resolution: [Resolution applied]

Post-Incident Review: [Date and time of post-incident review]


## Escalation Procedures

### When to Escalate
- Issue not resolved in [X] minutes
- Severity level increases
- Multiple services affected
- Business impact critical

### Escalation Process
1. Notify next level in escalation path
2. Provide incident summary
3. Hand off incident ownership
4. Update incident status

10. Disaster Recovery

Disaster Recovery Plan

markdown

# Disaster Recovery

## Recovery Objectives

### Recovery Time Objective (RTO)
- **Critical Services**: 4 hours
- **Important Services**: 8 hours
- **Non-Critical Services**: 24 hours

### Recovery Point Objective (RPO)
- **Critical Services**: 15 minutes
- **Important Services**: 1 hour
- **Non-Critical Services**: 24 hours

## Backup Strategy

### Database Backups
- **Frequency**: Every 15 minutes
- **Retention**: 30 days
- **Location**: [Backup location]
- **Type**: Incremental + Full daily

### Application Backups
- **Frequency**: Daily
- **Retention**: 90 days
- **Location**: [Backup location]
- **Type**: Full

### Configuration Backups
- **Frequency**: On change
- **Retention**: 365 days
- **Location**: [Backup location]
- **Type**: Version control

## Recovery Procedures

### Database Recovery
```bash
# Stop application
systemctl stop [service-name]

# Restore from backup
pg_restore -d [database] [backup-file]

# Verify data
psql -d [database] -c "SELECT COUNT(*) FROM [table];"

# Start application
systemctl start [service-name]

# Verify application
curl https://api.example.com/health

Application Recovery

bash

# Restore application backup
./scripts/restore-app.sh [backup-id]

# Verify configuration
./scripts/verify-config.sh

# Start services
systemctl start [service-name]

# Verify health
curl https://api.example.com/health

Failover Procedures

Primary to Secondary

bash

# Promote secondary to primary
./scripts/promote-secondary.sh

# Update DNS
./scripts/update-dns.sh

# Verify failover
curl https://api.example.com/health

Secondary to Primary

bash

# Restore primary
./scripts/restore-primary.sh

# Update DNS
./scripts/update-dns.sh

# Verify restoration
curl https://api.example.com/health

Testing

Backup Testing

Frequency: Monthly
Scope: Random backup
Verification: Data integrity
Documentation: Test results

Recovery Testing

Frequency: Quarterly
Scope: Full recovery drill
Verification: Service functionality
Documentation: Test results

Failover Testing

Frequency: Semi-annually
Scope: Full failover
Verification: Service continuity
Documentation: Test results


---

## 11. Maintenance Procedures

### Maintenance Tasks

```markdown
# Maintenance

## Routine Maintenance

### Daily Tasks
- [ ] Review logs for errors
- [ ] Check disk space
- [ ] Verify backup completion
- [ ] Review monitoring alerts

### Weekly Tasks
- [ ] Review performance metrics
- [ ] Check for security updates
- [ ] Review system logs
- [ ] Clean up temporary files

### Monthly Tasks
- [ ] Review and rotate logs
- [ ] Update documentation
- [ ] Review capacity planning
- [ ] Test backup restoration

### Quarterly Tasks
- [ ] Review and update runbooks
- [ ] Conduct disaster recovery test
- [ ] Review security posture
- [ ] Update software versions

## Backup and Restore

### Backup Procedures
```bash
# Database backup
pg_dump -U [username] -d [database] > [backup-file].sql

# Application backup
tar -czf [backup-file].tar.gz /var/www/app

# Configuration backup
cp /etc/app/config.yaml /backups/config-$(date +%Y%m%d).yaml

Restore Procedures

bash

# Database restore
psql -U [username] -d [database] < [backup-file].sql

# Application restore
tar -xzf [backup-file].tar.gz -C /var/www/

# Configuration restore
cp /backups/config-[date].yaml /etc/app/config.yaml

Health Checks

Application Health Check

bash

# Run health check
curl https://api.example.com/health

# Expected response
{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2024-01-15T10:00:00Z"
}

Database Health Check

bash

# Check database status
systemctl status postgresql

# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check database size
psql -c "SELECT pg_size_pretty(pg_database_size('[database]'));"

System Health Check

bash

# Check CPU
top -bn1 | grep "Cpu(s)"

# Check memory
free -h

# Check disk
df -h

# Check network
ping -c 4 [hostname]

Performance Tuning

Database Tuning

bash

# Analyze query performance
./scripts/analyze-queries.sh

# Update statistics
psql -c "ANALYZE;"

# Reindex if needed
psql -c "REINDEX DATABASE [database];"

Application Tuning

bash

# Review and update configuration
vim /etc/app/config.yaml

# Restart service
systemctl restart [service-name]

# Monitor performance
./scripts/monitor-performance.sh

System Tuning

bash

# Update system parameters
vim /etc/sysctl.conf

# Apply changes
sysctl -p

# Reboot if needed
reboot


---

## 12. Tools and Access

### Required Tools

```markdown
# Tools and Access

## Required Tools

### Command Line Tools
- **SSH**: For server access
- **curl**: For API testing
- **jq**: For JSON parsing
- **psql**: For database access

### Monitoring Tools
- **Grafana**: For dashboards
- **Prometheus**: For metrics
- **Kibana**: For log analysis

### Development Tools
- **Git**: For version control
- **Docker**: For container management
- **Kubernetes**: For orchestration

## Access Requirements

### Server Access
- **SSH Key**: Required for SSH access
- **VPN**: Required for remote access
- **MFA**: Required for authentication

### Database Access
- **Username**: [Database username]
- **Password**: [Password manager reference]
- **Connection String**: [Connection string]

### API Access
- **API Key**: [API key reference]
- **Authentication**: OAuth token
- **Base URL**: https://api.example.com

## Command References

### Service Management
```bash
# Start service
systemctl start [service-name]

# Stop service
systemctl stop [service-name]

# Restart service
systemctl restart [service-name]

# Check status
systemctl status [service-name]

# View logs
journalctl -u [service-name] -f

Database Commands

bash

# Connect to database
psql -h [hostname] -U [username] -d [database]

# Run query
psql -c "SELECT * FROM [table] LIMIT 10;"

# Import data
psql -d [database] < [file].sql

# Export data
pg_dump -U [username] -d [database] > [file].sql

Monitoring Commands

bash

# Check CPU
top

# Check memory
free -h

# Check disk
df -h

# Check processes
ps aux

Useful Scripts

Health Check Script

bash

#!/bin/bash
# health-check.sh

echo "Checking service health..."

# Check service status
systemctl is-active --quiet [service-name]
if [ $? -eq 0 ]; then
    echo "✓ Service is running"
else
    echo "✗ Service is not running"
    exit 1
fi

# Check health endpoint
curl -f https://api.example.com/health
if [ $? -eq 0 ]; then
    echo "✓ Health check passed"
else
    echo "✗ Health check failed"
    exit 1
fi

echo "All checks passed!"

Backup Script

bash

#!/bin/bash
# backup.sh

DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups"

# Database backup
pg_dump -U [username] -d [database] > $BACKUP_DIR/db_$DATE.sql

# Application backup
tar -czf $BACKUP_DIR/app_$DATE.tar.gz /var/www/app

echo "Backup completed: $DATE"

Restore Script

bash

#!/bin/bash
# restore.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup-file>"
    exit 1
fi

# Restore database
psql -U [username] -d [database] < $BACKUP_FILE

echo "Restore completed"


---

## 13. Contacts

### Contact Information

```markdown
# Contacts

## On-Call Rotation

### Current On-Call
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]
- **Time Zone**: [Time zone]

### Upcoming Schedule
| Date | Name | Phone | Slack |
|------|------|-------|-------|
| [Date] | [Name] | [Phone] | [@username] |
| [Date] | [Name] | [Phone] | [@username] |
| [Date] | [Name] | [Phone] | [@username] |

## Subject Matter Experts

### Database
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Database administration, optimization

### Security
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Security, compliance

### Infrastructure
- **Name**: [Name]
- **Email**: [Email]
- **Slack**: [@username]
- **Expertise**: Infrastructure, networking

## Escalation Contacts

### Team Lead
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

### Engineering Manager
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

### VP of Engineering
- **Name**: [Name]
- **Phone**: [Phone number]
- **Email**: [Email]
- **Slack**: [@username]

## Support Channels

### Internal Support
- **Slack Channel**: #support-[service-name]
- **Email**: support@example.com
- **Ticket System**: https://support.example.com

### External Support
- **Vendor**: [Vendor name]
- **Support Email**: [Email]
- **Support Phone**: [Phone]
- **Support Portal**: [URL]

### Emergency Contacts
- **IT Security**: [Phone]
- **Data Center**: [Phone]
- **Cloud Provider**: [Phone]

14. Testing Runbooks

Runbook Testing Checklist

markdown

# Testing Runbooks

## Testing Checklist

### Accuracy Testing
- [ ] All commands work as documented
- [ ] All links are valid
- [ ] All contact information is current
- [ ] All procedures are complete

### Completeness Testing
- [ ] All scenarios are covered
- [ ] All edge cases are addressed
- [ ] All dependencies are documented
- [ ] All tools are listed

### Usability Testing
- [ ] Procedures are clear and concise
- [ ] Steps are in logical order
- [ ] Examples are helpful
- [ ] Troubleshooting is comprehensive

### Emergency Testing
- [ ] Emergency procedures work
- [ ] Escalation paths are clear
- [ ] Contact information is accurate
- [ ] Communication templates are usable

## Testing Schedule

### Regular Testing
- **Frequency**: Quarterly
- **Scope**: All procedures
- **Participants**: On-call team
- **Documentation**: Test results

### Emergency Testing
- **Frequency**: Semi-annually
- **Scope**: Emergency procedures
- **Participants**: All stakeholders
- **Documentation**: Test results and improvements

## Test Documentation

### Test Report Template
```markdown
# Runbook Test Report

**Date**: [Date]
**Tester**: [Name]
**Runbook**: [Runbook Name]

## Test Results

### Accuracy
- [ ] Commands tested
- [ ] Links verified
- [ ] Contacts validated

### Completeness
- [ ] Scenarios covered
- [ ] Edge cases addressed
- [ ] Dependencies documented

### Usability
- [ ] Procedures clear
- [ ] Steps logical
- [ ] Examples helpful

### Emergency
- [ ] Procedures work
- [ ] Escalation clear
- [ ] Contacts accurate

## Issues Found

1. [Issue description]
2. [Issue description]
3. [Issue description]

## Recommendations

1. [Recommendation]
2. [Recommendation]
3. [Recommendation]

## Next Review Date

[Date]


---

## 15. Templates

### Runbook Templates

```markdown
# Runbook Template

# [Service Name] Runbook

## Service Overview
[Fill in service overview]

## Architecture
[Fill in architecture details]

## Deployment
[Fill in deployment procedures]

## Monitoring
[Fill in monitoring details]

## Alerts
[Fill in alert configuration]

## Troubleshooting
[Fill in troubleshooting procedures]

## Emergency Procedures
[Fill in emergency procedures]

## Disaster Recovery
[Fill in disaster recovery plan]

## Maintenance
[Fill in maintenance procedures]

## Tools and Access
[Fill in tools and access details]

## Contacts
[Fill in contact information]

## Version History
| Version | Date | Author | Changes |
|---------|------|---------|---------|
| 1.0 | [Date] | [Name] | Initial version |
| 1.1 | [Date] | [Name] | [Changes] |

16. Best Practices

Runbook Best Practices

markdown

# Best Practices

## 1. Keep Current
- Update regularly
- Review quarterly
- Test procedures
- Archive old versions

## 2. Be Clear
- Use simple language
- Avoid jargon
- Provide context
- Explain decisions

## 3. Be Complete
- Cover all scenarios
- Include edge cases
- Document dependencies
- Provide examples

## 4. Be Accessible
- Use standard formats
- Support search
- Provide navigation
- Include glossary

## 5. Be Practical
- Focus on what works
- Provide real solutions
- Include workarounds
- Share lessons learned

## 6. Be Collaborative
- Involve team
- Get feedback
- Update regularly
- Share knowledge

Quick Start

Basic Runbook Template

markdown

# Runbook: [Task Name]

## Overview
[Brief description of what this runbook covers]

## Prerequisites
- [ ] Requirement 1
- [ ] Requirement 2

## Steps

### Step 1: [Action]
1. Do this
2. Then this
3. Verify result

### Step 2: [Action]
[Detailed steps]

## Troubleshooting

### Problem: [Issue]
**Symptoms:** [What you see]
**Solution:** [How to fix]

## Rollback

If something goes wrong:
1. Step 1
2. Step 2

## Related Runbooks
- [Related runbook 1]
- [Related runbook 2]

Production Checklist

Clear Purpose: Each runbook has clear purpose and scope
Step-by-Step: Instructions are numbered and sequential
Prerequisites: All prerequisites listed and verified
Verification: Each step includes verification criteria
Troubleshooting: Common issues and solutions documented
Rollback: Rollback procedures included for risky operations
Testing: Runbooks tested by someone other than author
Updates: Runbooks reviewed and updated regularly
Accessibility: Runbooks easily accessible to team
Version Control: Runbooks versioned and tracked
Feedback: Mechanism for team to provide feedback
Links: Related runbooks and documentation linked

Anti-patterns

❌ Don't: Vague Instructions

markdown

# ❌ Bad - Unclear
## Step 1: Restart the service
[No details on how]

markdown

# ✅ Good - Specific
## Step 1: Restart the service
1. SSH into server: `ssh user@server`
2. Check service status: `systemctl status myservice`
3. Restart service: `sudo systemctl restart myservice`
4. Verify: `systemctl status myservice` (should show "active")

❌ Don't: Missing Prerequisites

markdown

# ❌ Bad - No prerequisites
## Steps
1. Run deployment script

markdown

# ✅ Good - Prerequisites listed
## Prerequisites
- [ ] Access to production environment
- [ ] Database backup completed
- [ ] Team notified of maintenance window

## Steps
1. Run deployment script

❌ Don't: No Troubleshooting

markdown

# ❌ Bad - No help when things go wrong
## Steps
1. Deploy application
[What if it fails?]

markdown

# ✅ Good - Troubleshooting included
## Steps
1. Deploy application

## Troubleshooting

### Problem: Deployment fails
**Symptoms:** Error message "Connection timeout"
**Solution:** 
1. Check network connectivity
2. Verify credentials
3. Retry with increased timeout

Integration Points

Technical Writing (21-documentation/technical-writing/) - Clear documentation
System Architecture (21-documentation/system-architecture-docs/) - System context
Incident Management (41-incident-management/) - Incident response runbooks

7. Be Tested

Test all procedures
Verify all commands
Check all links
Validate all contacts

8. Be Useful

Focus on audience
Answer questions
Solve problems
Enable decisions

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/runbooks
License: MIT License

Featured Tools

Join Our Newsletter

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Operational Runbooks

Overview

1. Executive Summary & Strategic Necessity

2. Technical Deep Dive (The "How-to")

3. Tooling & Tech Stack

4. Standards, Compliance & Security

5. Unit Economics & Performance Metrics (KPIs)

6. Strategic Recommendations (CTO Insights)

Core Concepts

1. Runbook Purpose

Why Runbooks Matter

2. Runbook Structure

Standard Runbook Structure

3. Service Overview

Service Overview Template

4. Architecture

Architecture Documentation

Components

Component 1: [Component Name]

Component 2: [Component Name]

Component 3: [Component Name]

Data Flow

Request Flow

Data Flow Diagram

Integration Points

External Integrations

Internal Integrations

2. Build and Test

3. Deploy

4. Post-Deployment Verification

Rollback Procedures

When to Rollback

Rollback Steps

Deployment Environments

Development

Staging

Production

Deployment Verification

Health Check Endpoint

Smoke Tests

7. Alerts

Alert Configuration

8. Troubleshooting

Troubleshooting Procedures

Issue 2: High Error Rate

Issue 3: Slow Response Time

Issue 4: Database Connection Issues

10. Disaster Recovery

Disaster Recovery Plan

Application Recovery

Failover Procedures

Primary to Secondary

Secondary to Primary

Testing

Backup Testing

Recovery Testing

Failover Testing

Restore Procedures

Health Checks

Application Health Check

Database Health Check

System Health Check

Performance Tuning

Database Tuning

Application Tuning

System Tuning

Database Commands

Monitoring Commands

Useful Scripts

Health Check Script

Backup Script

Restore Script

14. Testing Runbooks

Runbook Testing Checklist

16. Best Practices

Runbook Best Practices

Quick Start