Observability

Decision Framework: What to Instrument

Golden Signals (prefer for services): Latency, Traffic, Errors, Saturation

RED (request-scoped): Rate, Errors, Duration
USE (resource-scoped): Utilization, Saturation, Errors

Pick RED for microservices, USE for infrastructure. Don't mix.

Metric Design Opinions

Always use histograms over summaries for latency -- histograms are aggregatable, summaries are not
Bucket defaults [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] cover most HTTP services
Exclude /health and /metrics endpoints from SLI calculations
Use recording rules for any query used in alerts or dashboards -- never put raw PromQL in alerts
Label cardinality kills Prometheus: never use user IDs, request IDs, or unbounded values as labels

Service Tier Classification

Tier	Availability	Latency P99	Examples
Critical	99.95%	100ms	Payment, auth
Essential	99.9%	500ms	Search, catalog
Standard	99.5%	1s	Recommendations
Best Effort	99.0%	2s	Batch, reporting

Assign tiers before writing SLOs. Tier drives alert routing and error budget policy.

SLO Framework

Error Budget Policy (non-negotiable escalation)

Budget Remaining	Action
>50%	Normal velocity
10-50%	Postpone risky changes
1-10%	Freeze non-critical changes
0%	Feature freeze, reliability only

Release Decision Matrix

Budget Status	Low Risk	Medium Risk	High Risk
Healthy	Approve	Approve	Review
Warning	Review	Defer	Block
Critical	Defer	Block	Block
Exhausted	Block	Block	Block

Burn Rate Alert Thresholds

Alert	Burn Rate	Short Window	Action
Fast burn	14.4x	1h + 5m	Page on-call
Slow burn	3x	6h + 30m	Create ticket

Multi-window burn rate is the only correct SLO alerting pattern. Single-window alerts produce false positives or miss slow degradation.

Progressive SLO Rollout

Start at 99.0% for 1 month baseline, then tighten: 99.5% (2 months) -> 99.9% (3 months) -> 99.95% (ongoing). Never set SLO tighter than current measured reliability.

SLO Templates

API service: availability (99.9% over 30d) + latency (95% of requests < 500ms over 30d) Data pipeline: freshness (99% batches within 30 min over 7d) + completeness (99.95% records processed over 7d)

Distributed Tracing Strategy

Sampling Decisions

Dev/staging: 100% sampling
Production low-traffic (<1k rps): 10-50% probabilistic
Production high-traffic (>10k rps): 1% probabilistic or rate-limit to ~100 traces/sec
Always use ParentBased sampler so child spans follow parent's decision
Force-sample all errors and high-latency requests regardless of probabilistic rate

Context Propagation

Use W3C traceparent header (not B3 or Jaeger-native) for new systems
Always inject trace_id into structured logs for correlation
Propagate context through async boundaries (queues, event buses) explicitly

Backend Choice

Tempo (Grafana stack): prefer when already using Grafana; object-storage backed, cheap at scale
Jaeger: prefer when you need standalone deployment or Elasticsearch integration
Both support OTLP -- always send via OpenTelemetry Collector, never direct from app to backend

Alerting Opinions

Alert on symptoms, not causes -- alert on error rate, not "pod restarted"
Severity levels: critical (pages), warning (tickets), info (dashboard only)
Every alert must have a runbook link in annotations
for: duration: critical >= 2m, warning >= 5m, info >= 15m -- prevents flapping
Route critical to PagerDuty, warning to Slack channel, info to dashboard only

Stack Preferences

Concern	Preferred Tool	Rationale
Metrics	Prometheus + Thanos/Mimir	De facto standard, PromQL ecosystem
Visualization	Grafana	Dashboard-as-code, multi-datasource
Tracing	Tempo or Jaeger via OTel	OTLP-native, cost-effective
Logs	Loki or OpenSearch	Loki for Grafana stack, OpenSearch for complex queries
Collector	OpenTelemetry Collector	Vendor-neutral pipeline, single agent
Alerting	Alertmanager	Native Prometheus integration

Gotchas

Prometheus rate() requires at least 2 data points in the window -- use [5m] minimum with 15s scrape interval
histogram_quantile is an estimate; accuracy degrades with poor bucket choices
OTel Collector batch processor default timeout is 200ms -- increase to 5-10s for production to reduce export overhead
Grafana dashboards without variables become unmaintainable past 3 services
Never scrape intervals faster than 10s in production -- it causes storage and CPU issues
Alertmanager grouping: group by alertname, namespace, service -- too broad silences everything, too narrow floods on-call

Search AI Tools

observability

Install this agent skill to your Project

SKILL.md