Agent skill
operating-production-services
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Install this agent skill to your Project
npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/asmayaseen/operating-production-services
SKILL.md
Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
Quick Reference
| Need | Go To |
|---|---|
| Define reliability targets | SLOs & Error Budgets |
| Write incident report | Postmortem Templates |
| Set up SLO alerting | references/slo-alerting.md |
SLOs & Error Budgets
The Hierarchy
SLA (Contract) → SLO (Target) → SLI (Measurement)
Common SLIs
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
Don't aim for 100%. Each nine costs exponentially more.
Error Budget
Error Budget = 1 - SLO Target
Example: 99.9% SLO = 0.1% error budget = 43 minutes/month
Policy:
| Budget Remaining | Action |
|---|---|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.
Postmortem Templates
The Blameless Principle
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
When to Write Postmortems
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
Standard Template
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX
## Executive Summary
One paragraph: what happened, impact, root cause, resolution.
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
## Root Cause Analysis
### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]
## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
Quick Template (Minor Incidents)
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3
## What Happened
One sentence description.
## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
## Root Cause
One sentence.
## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
Postmortem Meeting Guide
Structure (60 min)
- Opening (5 min) - Remind: "We're here to learn, not blame"
- Timeline (15 min) - Walk through events chronologically
- Analysis (20 min) - What failed? Why? What allowed it?
- Action Items (15 min) - Prioritize, assign owners, set dates
- Closing (5 min) - Summarize learnings, confirm owners
Facilitation Tips
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
Anti-Patterns
| Don't | Do Instead |
|---|---|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
Verification
Run: python scripts/verify.py
References
- references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
perigon-backend
Perigon ASP.NET Core + EF Core + Aspire conventions
perigon-agent
Pointers for Copilot/agents to apply Perigon conventions
perigon-angular
Angular 21+ standalone/Material/signal conventions for Perigon WebApp
fastapi-mastery
Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.
context7-efficient
Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.
browser-use
Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.
Didn't find tool you were looking for?