Agent skill
observability
Use when implementing metrics, tracing, SLOs, alerting, or dashboards. Covers Prometheus/Grafana/OTel stack design, SLI/SLO frameworks, error budgets, burn-rate alerting, and distributed tracing strategy.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/other/other/observability-jlaws-dotfiles
SKILL.md
Observability
Decision Framework: What to Instrument
Golden Signals (prefer for services): Latency, Traffic, Errors, Saturation
- RED (request-scoped): Rate, Errors, Duration
- USE (resource-scoped): Utilization, Saturation, Errors
Pick RED for microservices, USE for infrastructure. Don't mix.
Metric Design Opinions
- Always use histograms over summaries for latency -- histograms are aggregatable, summaries are not
- Bucket defaults
[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]cover most HTTP services - Exclude
/healthand/metricsendpoints from SLI calculations - Use recording rules for any query used in alerts or dashboards -- never put raw PromQL in alerts
- Label cardinality kills Prometheus: never use user IDs, request IDs, or unbounded values as labels
Service Tier Classification
| Tier | Availability | Latency P99 | Examples |
|---|---|---|---|
| Critical | 99.95% | 100ms | Payment, auth |
| Essential | 99.9% | 500ms | Search, catalog |
| Standard | 99.5% | 1s | Recommendations |
| Best Effort | 99.0% | 2s | Batch, reporting |
Assign tiers before writing SLOs. Tier drives alert routing and error budget policy.
SLO Framework
Error Budget Policy (non-negotiable escalation)
| Budget Remaining | Action |
|---|---|
| >50% | Normal velocity |
| 10-50% | Postpone risky changes |
| 1-10% | Freeze non-critical changes |
| 0% | Feature freeze, reliability only |
Release Decision Matrix
| Budget Status | Low Risk | Medium Risk | High Risk |
|---|---|---|---|
| Healthy | Approve | Approve | Review |
| Warning | Review | Defer | Block |
| Critical | Defer | Block | Block |
| Exhausted | Block | Block | Block |
Burn Rate Alert Thresholds
| Alert | Burn Rate | Short Window | Action |
|---|---|---|---|
| Fast burn | 14.4x | 1h + 5m | Page on-call |
| Slow burn | 3x | 6h + 30m | Create ticket |
Multi-window burn rate is the only correct SLO alerting pattern. Single-window alerts produce false positives or miss slow degradation.
Progressive SLO Rollout
Start at 99.0% for 1 month baseline, then tighten: 99.5% (2 months) -> 99.9% (3 months) -> 99.95% (ongoing). Never set SLO tighter than current measured reliability.
SLO Templates
API service: availability (99.9% over 30d) + latency (95% of requests < 500ms over 30d) Data pipeline: freshness (99% batches within 30 min over 7d) + completeness (99.95% records processed over 7d)
Distributed Tracing Strategy
Sampling Decisions
- Dev/staging: 100% sampling
- Production low-traffic (<1k rps): 10-50% probabilistic
- Production high-traffic (>10k rps): 1% probabilistic or rate-limit to ~100 traces/sec
- Always use
ParentBasedsampler so child spans follow parent's decision - Force-sample all errors and high-latency requests regardless of probabilistic rate
Context Propagation
- Use W3C
traceparentheader (not B3 or Jaeger-native) for new systems - Always inject trace_id into structured logs for correlation
- Propagate context through async boundaries (queues, event buses) explicitly
Backend Choice
- Tempo (Grafana stack): prefer when already using Grafana; object-storage backed, cheap at scale
- Jaeger: prefer when you need standalone deployment or Elasticsearch integration
- Both support OTLP -- always send via OpenTelemetry Collector, never direct from app to backend
Alerting Opinions
- Alert on symptoms, not causes -- alert on error rate, not "pod restarted"
- Severity levels:
critical(pages),warning(tickets),info(dashboard only) - Every alert must have a runbook link in annotations
for:duration: critical >= 2m, warning >= 5m, info >= 15m -- prevents flapping- Route critical to PagerDuty, warning to Slack channel, info to dashboard only
Stack Preferences
| Concern | Preferred Tool | Rationale |
|---|---|---|
| Metrics | Prometheus + Thanos/Mimir | De facto standard, PromQL ecosystem |
| Visualization | Grafana | Dashboard-as-code, multi-datasource |
| Tracing | Tempo or Jaeger via OTel | OTLP-native, cost-effective |
| Logs | Loki or OpenSearch | Loki for Grafana stack, OpenSearch for complex queries |
| Collector | OpenTelemetry Collector | Vendor-neutral pipeline, single agent |
| Alerting | Alertmanager | Native Prometheus integration |
Gotchas
- Prometheus
rate()requires at least 2 data points in the window -- use[5m]minimum with 15s scrape interval histogram_quantileis an estimate; accuracy degrades with poor bucket choices- OTel Collector
batchprocessor default timeout is 200ms -- increase to 5-10s for production to reduce export overhead - Grafana dashboards without variables become unmaintainable past 3 services
- Never scrape intervals faster than 10s in production -- it causes storage and CPU issues
- Alertmanager grouping: group by
alertname, namespace, service-- too broad silences everything, too narrow floods on-call
Didn't find tool you were looking for?