Agent skill

slo-alerting

Define SLIs, SLOs, and implement burn-rate alerting

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/slo-alerting

SKILL.md

SLO Alerting

Define SLIs, set SLO targets, alert on burn rate (not raw error rate).

Concepts

Term Definition Example
SLI Quantitative measure % successful requests
SLO Target for SLI 99.9% success
Error Budget Allowed failure 0.1% = 43 min/month
Burn Rate Budget consumption speed 10x = exhausted in 3 days

Common SLIs

Availability: successful_requests / total_requests
Latency:      requests_under_threshold / total_requests
Error Rate:   error_requests / total_requests

Burn Rate Alerting

Alert on how fast you're consuming budget, not raw error rate:

Alert Level Burn Rate Time to Exhaust
Page (critical) 14.4x 2 days
Page (warning) 6x 5 days
Ticket (medium) 3x 10 days

Multi-Window Strategy

Use long + short windows to balance speed and noise:

yaml
# Critical: Fast burn (14.4x over 1h AND 5m)
- alert: HighBurnRate_Critical
  expr: (rate_1h / budget > 14.4) and (rate_5m / budget > 14.4)
  severity: critical

# Warning: Slower burn (6x over 6h AND 30m)
- alert: HighBurnRate_Warning
  expr: (rate_6h / budget > 6) and (rate_30m / budget > 6)
  severity: warning

Dashboard Essentials

  • Current burn rate
  • Error budget remaining (%)
  • Time until exhaustion at current rate

Anti-Patterns

  • Too many SLOs → SLO per user journey, not per endpoint
  • Alerting on raw error rate → Noisy, doesn't account for budget
  • No budget visualization → Teams don't understand burn rate

References

  • references/methodology/sli-slo-framework.md

Didn't find tool you were looking for?

Be as detailed as possible for better results