Agent skill

qa-observability

Production observability and performance engineering with OpenTelemetry, distributed tracing, metrics, logging, SLO/SLI design, capacity planning, performance profiling, APM integration, and observability maturity progression for modern cloud-native systems.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/testing/qa-observability-vasilyu1983-ai-agents-public

SKILL.md

Observability & Performance Engineering — Production-Ready Systems

This skill provides execution-ready patterns for building observable, performant systems. Claude should apply these patterns when users need observability stack setup, performance optimization, capacity planning, or production monitoring.

Modern Best Practices (2025): OpenTelemetry standard, distributed tracing, unified telemetry (logs+metrics+traces), SLO-driven alerting, eBPF-based observability, AI-assisted anomaly detection, performance budgets, and observability maturity models.


When to Use This Skill

Claude should invoke this skill when a user requests:

  • OpenTelemetry instrumentation and setup
  • Distributed tracing implementation (Jaeger, Tempo, Zipkin)
  • Metrics collection and dashboarding (Prometheus, Grafana)
  • Structured logging setup (Pino, Winston, structlog)
  • SLO/SLI definition and error budgets
  • Performance profiling and optimization
  • Capacity planning and resource forecasting
  • APM integration (Datadog, New Relic, Dynatrace)
  • Observability maturity assessment
  • Alert design and on-call runbooks
  • Performance budgeting (frontend and backend)
  • Cost-performance optimization
  • Production performance debugging

Quick Reference

Task Tool/Framework Command/Setup When to Use
Distributed tracing OpenTelemetry + Jaeger Auto-instrumentation, manual spans Microservices, debugging request flow
Metrics collection Prometheus + Grafana Expose /metrics endpoint, scrape config Track latency, error rate, throughput
Structured logging Pino (Node.js), structlog (Python) JSON logs with trace ID Production debugging, log aggregation
SLO/SLI definition Prometheus queries, SLO YAML Availability, latency, error rate SLIs Reliability targets, error budgets
Performance profiling Clinic.js, Chrome DevTools, cProfile CPU/memory flamegraphs Slow application, high resource usage
Load testing k6, Artillery Ramp-up, spike, soak tests Capacity planning, performance validation
APM integration Datadog, New Relic, Dynatrace Agent installation, instrumentation Full observability stack
Web performance Lighthouse CI, Web Vitals Performance budgets in CI/CD Frontend optimization

Decision Tree: Observability Strategy

text
User needs: [Observability Task Type]
    ├─ Starting New Service?
    │   ├─ Microservices? → OpenTelemetry auto-instrumentation + Jaeger
    │   ├─ Monolith? → Structured logging (Pino/structlog) + Prometheus
    │   ├─ Frontend? → Web Vitals + performance budgets
    │   └─ All services? → Full stack (logs + metrics + traces)
    │
    ├─ Debugging Issues?
    │   ├─ Distributed system? → Distributed tracing (search by trace ID)
    │   ├─ Single service? → Structured logs (search by request ID)
    │   ├─ Performance problem? → CPU/memory profiling
    │   └─ Database slow? → Query profiling (EXPLAIN ANALYZE)
    │
    ├─ Reliability Targets?
    │   ├─ Define SLOs? → Availability, latency, error rate SLIs
    │   ├─ Error budgets? → Calculate allowed downtime per SLO
    │   ├─ Alerting? → Burn rate alerts (fast: 1h, slow: 6h)
    │   └─ Dashboard? → Grafana SLO dashboard with error budget
    │
    ├─ Performance Optimization?
    │   ├─ Find bottlenecks? → CPU/memory profiling
    │   ├─ Database queries? → EXPLAIN ANALYZE, indexing
    │   ├─ Frontend slow? → Lighthouse, Web Vitals analysis
    │   └─ Load testing? → k6 scenarios (ramp, spike, soak)
    │
    └─ Capacity Planning?
        ├─ Baseline metrics? → Collect 30 days of traffic data
        ├─ Load testing? → Test at 2x expected peak
        ├─ Forecasting? → Time series analysis (Prophet, ARIMA)
        └─ Cost optimization? → Right-size instances, spot instances

Navigation: Core Implementation Patterns

See resources/core-observability-patterns.md for detailed implementation guides:

  • OpenTelemetry End-to-End Setup - Complete instrumentation with Node.js/Python examples

    • Three pillars of observability (logs, metrics, traces)
    • Auto-instrumentation and manual spans
    • OTLP exporters and collectors
    • Production checklist
  • Distributed Tracing Strategy - Service-to-service trace propagation

    • W3C Trace Context standard
    • Sampling strategies (always-on, probabilistic, parent-based, adaptive)
    • Cross-service correlation
    • Trace backend configuration
  • SLO/SLI Design & Error Budgets - Reliability targets and alerting

    • SLI definitions (availability, latency, error rate)
    • Prometheus queries for SLIs
    • Error budget calculation and policies
    • Burn rate alerts (fast: 1h, slow: 6h)
  • Structured Logging - Production-ready JSON logs

    • Log format with trace correlation
    • Pino (Node.js) and structlog (Python) setup
    • Log levels and what NOT to log
    • Centralized aggregation (ELK, Loki, Datadog)
  • Performance Profiling - CPU, memory, database, frontend optimization

    • Node.js profiling (Chrome DevTools, Clinic.js)
    • Memory leak detection (heap snapshots)
    • Database query profiling (EXPLAIN ANALYZE)
    • Web Vitals and performance budgets
  • Capacity Planning - Scale planning and cost optimization

    • Capacity formula and calculations
    • Load testing with k6
    • Resource forecasting (Prophet, ARIMA)
    • Cost per request optimization

Navigation: Observability Maturity

See resources/observability-maturity-model.md for maturity assessment:

  • Level 1: Reactive (Firefighting) - Manual log grepping, hours to resolve

    • Basic logging to files
    • No structured logging or distributed tracing
    • Progression: Centralize logs, add metrics
  • Level 2: Proactive (Monitoring) - Centralized logs, application alerts

    • Structured JSON logs with ELK/Splunk
    • Prometheus metrics and Grafana dashboards
    • Progression: Add distributed tracing, define SLOs
  • Level 3: Predictive (Observability) - Unified telemetry, SLO-driven

    • Distributed tracing (Jaeger, Tempo)
    • Automatic trace-log-metric correlation
    • SLO/SLI-based alerting with error budgets
    • Progression: AI anomaly detection, self-healing
  • Level 4: Autonomous (AIOps) - AI-powered, self-healing systems

    • AI anomaly detection and root cause analysis
    • Predictive capacity planning
    • Auto-remediation and continuous optimization
    • Chaos engineering for resilience

Maturity Assessment Tool - Rate your organization (0-5 per category):

  • Logging, metrics, tracing, alerting, incident response
  • MTTR/MTTD benchmarks by level
  • Recommended next steps

Navigation: Anti-Patterns & Best Practices

See resources/anti-patterns-best-practices.md for common mistakes:

Critical Anti-Patterns:

  1. Logging Everything - Log bloat and high costs
  2. No Sampling - 100% trace collection is expensive
  3. Alert Fatigue - Too many noisy alerts
  4. Ignoring Tail Latency - P99 matters more than average
  5. No Error Budgets - Teams move too slow or too fast
  6. Metrics Without Context - Dashboard mysteries
  7. No Cost Tracking - Observability can cost 20% of infrastructure
  8. Point-in-Time Profiling - Missing intermittent issues

Best Practices Summary:

  • Sample traces intelligently (100% errors, 1-10% success)
  • Alert on SLO burn rate, not infrastructure metrics
  • Track tail latency (P99, P999), not just averages
  • Use error budgets to balance velocity vs reliability
  • Add context to dashboards (baselines, annotations, SLOs)
  • Track observability costs, optimize aggressively
  • Continuous profiling for intermittent issues

Templates

See templates/ for copy-paste ready examples organized by domain and tech stack:

OpenTelemetry Instrumentation:

Monitoring & SLO:

Load Testing:

Performance Optimization:


Resources

See resources/ for deep-dive operational guides:


External Resources

See data/sources.json for curated sources:

  • OpenTelemetry documentation and specifications
  • APM platforms (Datadog, New Relic, Dynatrace, Honeycomb)
  • Observability tools (Prometheus, Grafana, Jaeger, Tempo, Loki)
  • SRE books (Google SRE, Site Reliability Workbook)
  • Performance tooling (Lighthouse, k6, Clinic.js)
  • Web Vitals and Core Web Vitals

Related Skills

DevOps & Infrastructure:

Backend Development:

Quality & Reliability:

AI/ML Operations:


Quick Decision Matrix

Scenario Recommendation
Starting new service OpenTelemetry auto-instrumentation + structured logging
Debugging microservices Distributed tracing with Jaeger/Tempo
Setting reliability targets Define SLOs for availability, latency, error rate
Application is slow CPU/memory profiling + database query analysis
Planning for scale Load testing + capacity forecasting
High infrastructure costs Cost per request analysis + right-sizing
Improving observability Assess maturity level + 6-month roadmap
Frontend performance issues Web Vitals monitoring + performance budgets

Usage Notes

When to Apply Patterns:

  • New service setup → Start with OpenTelemetry + structured logging + basic metrics
  • Microservices debugging → Use distributed tracing for full request visibility
  • Reliability requirements → Define SLOs first, then implement monitoring
  • Performance issues → Profile first, optimize second (measure before optimizing)
  • Capacity planning → Collect baseline metrics for 30 days before load testing
  • Observability maturity → Assess current level, plan 6-month progression

Common Workflows:

  1. New Service: OpenTelemetry → Prometheus → Grafana → Define SLOs
  2. Debugging: Search logs by trace ID → Click trace → See full request flow
  3. Performance: Profile CPU/memory → Identify bottleneck → Optimize → Validate
  4. Capacity Planning: Baseline metrics → Load test 2x peak → Forecast growth → Scale proactively

Optimization Priorities:

  1. Correctness (logs, traces, metrics actually help debug)
  2. Cost (optimize sampling, retention, cardinality)
  3. Performance (observability overhead <5% latency)

Success Criteria: Systems are fully observable with unified telemetry (logs+metrics+traces), SLOs drive alerting and feature velocity, performance is proactively optimized, and capacity is planned ahead of demand.

Didn't find tool you were looking for?

Be as detailed as possible for better results