Agent skill
slo-implementation
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Install this agent skill to your Project
npx add-skill https://github.com/Microck/ordinary-claude-skills/tree/main/skills_categorized/containers/slo-implementation
SKILL.md
SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
When to Use
- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals
SLI/SLO/SLA Hierarchy
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
Defining SLIs
Common SLI Types
1. Availability SLI
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
2. Latency SLI
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
3. Durability SLI
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets
Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Choose Appropriate SLOs
Consider:
- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks
Example SLOs:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation
Error Budget Formula
Error Budget = 1 - SLO Target
Example:
- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%
Error Budget Policy
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation
Prometheus Recording Rules
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
SLO Alerting Rules
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
SLO Dashboard
Grafana Dashboard Structure:
┌────────────────────────────────────┐
│ SLO Compliance (Current) │
│ ✓ 99.95% (Target: 99.9%) │
├────────────────────────────────────┤
│ Error Budget Remaining: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI Trend (28 days) │
│ [Time series graph] │
├────────────────────────────────────┤
│ Burn Rate Analysis │
│ [Burn rate by time window] │
└────────────────────────────────────┘
Example Queries:
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
Multi-Window Burn Rate Alerts
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
SLO Review Process
Weekly Review
- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact
Monthly Review
- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments
Quarterly Review
- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements
Best Practices
- Start with user-facing services
- Use multiple SLIs (availability, latency, etc.)
- Set achievable SLOs (don't aim for 100%)
- Implement multi-window alerts to reduce noise
- Track error budget consistently
- Review SLOs regularly
- Document SLO decisions
- Align with business goals
- Automate SLO reporting
- Use SLOs for prioritization
Reference Files
assets/slo-template.md- SLO definition templatereferences/slo-definitions.md- SLO definition patternsreferences/error-budget.md- Error budget calculations
Related Skills
prometheus-configuration- For metric collectiongrafana-dashboards- For SLO visualization
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
nondominium-holochain-dna-dev
Specialized skill for nondominium Holochain DNA development, focusing on zome creation, entry patterns, integrity/coordinator architecture, ValueFlows compliance, and WASM optimization. Use when creating new zomes, implementing entry types, or modifying Holochain DNA code.
fluidsim
Framework for computational fluid dynamics simulations using Python. Use when running fluid dynamics simulations including Navier-Stokes equations (2D/3D), shallow water equations, stratified flows, or when analyzing turbulence, vortex dynamics, or geophysical flows. Provides pseudospectral methods with FFT, HPC support, and comprehensive output analysis.
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
run-tests
Validate code changes by intelligently selecting and running the appropriate test suites. Use this when editing code to verify changes work correctly, run tests, validate functionality, or check for regressions. Automatically discovers affected test suites, selects the minimal set of venvs needed for validation, and handles test execution with Docker services as needed.
skill-navigator
The 100th skill! Your intelligent guide to all 99 other skills. Recommends the perfect skill for any task, creates skill combinations, and helps you discover capabilities you didn't know you had.
AgentDB Advanced Features
Master advanced AgentDB features including QUIC synchronization, multi-database management, custom distance metrics, hybrid search, and distributed systems integration. Use when building distributed AI systems, multi-agent coordination, or advanced vector search applications.
Didn't find tool you were looking for?