QA Observability & Performance Engineering (Dec 2025)

This skill provides execution-ready patterns for building observable, performant systems and using telemetry as part of QA workflows.

Core references: OpenTelemetry (Docs) and W3C Trace Context (Spec) for correlation; SLO/error budget guidance from the Google SRE Book (Service Level Objectives).

When to Use This Skill

Claude should invoke this skill when a user requests:

OpenTelemetry instrumentation and setup
Distributed tracing implementation (Jaeger, Tempo, Zipkin)
Metrics collection and dashboarding (Prometheus, Grafana)
Structured logging setup (Pino, Winston, structlog)
SLO/SLI definition and error budgets
Performance profiling and optimization
Capacity planning and resource forecasting
APM integration (Datadog, New Relic, Dynatrace)
Observability maturity assessment
Alert design and on-call runbooks
Performance budgeting (frontend and backend)
Cost-performance optimization
Production performance debugging

Core QA (Default)

What “Observability for QA” Means

Treat telemetry as a first-class test oracle and debugging substrate, not an ops-only concern.
Make every failure diagnosable by default: logs + metrics + traces with correlation IDs (OpenTelemetry, W3C Trace Context).

QA Signals (Use in CI and in Prod)

Logs: structured, redacted, correlated with request/trace IDs (avoid PII).
Metrics: SLIs for latency, availability, error rate, saturation; track tail latency (p95/p99).
Traces: end-to-end request paths; use traces to localize failures across services.

Synthetic vs RUM (How QA Uses Both)

Synthetic monitoring: deterministic probes that validate availability and critical flows from the outside.
RUM (real user monitoring): validates real-world performance and regressions; use as a feedback loop for test coverage and performance budgets.

SLIs/SLOs as Quality Gates

Use SLOs and error budgets to set reliability expectations and release gates (Service Level Objectives).
Example release gates [Inference]:
- Block deploy when error budget burn is above policy (fast + slow windows).
- Block deploy on sustained p99 regression beyond a defined budget.

Trace-Based Debugging for Test Failures

Every integration/E2E test should emit a correlation ID and capture a trace link on failure.
Prefer “find the failing span” over grepping logs; use logs to enrich the span narrative.

CI Economics (Telemetry Cost Control)

Sampling is a quality lever: keep 100% for errors, sample successes as needed [Inference].
Keep dashboards high-signal; alert on symptoms (SLO burn) not raw resource metrics.

Do / Avoid

Do:

Define an “observability readiness” bar before E2E and chaos tests.
Make test tooling attach IDs (request/trace) to failures for fast triage.

Avoid:

Logging secrets/PII or writing dashboards with no owners.
Alerting on “everything”; alert fatigue is a quality failure mode.

Quick Reference

Task	Tool/Framework	Command/Setup	When to Use
Distributed tracing	OpenTelemetry + Jaeger	Auto-instrumentation, manual spans	Microservices, debugging request flow
Metrics collection	Prometheus + Grafana	Expose /metrics endpoint, scrape config	Track latency, error rate, throughput
Structured logging	Pino (Node.js), structlog (Python)	JSON logs with trace ID	Production debugging, log aggregation
SLO/SLI definition	Prometheus queries, SLO YAML	Availability, latency, error rate SLIs	Reliability targets, error budgets
Performance profiling	Clinic.js, Chrome DevTools, cProfile	CPU/memory flamegraphs	Slow application, high resource usage
Load testing	k6, Artillery	Ramp-up, spike, soak tests	Capacity planning, performance validation
APM integration	Datadog, New Relic, Dynatrace	Agent installation, instrumentation	Full observability stack
Web performance	Lighthouse CI, Web Vitals	Performance budgets in CI/CD	Frontend optimization

Decision Tree: Observability Strategy

text

User needs: [Observability Task Type]
    ├─ Starting New Service?
    │   ├─ Microservices? → OpenTelemetry auto-instrumentation + Jaeger
    │   ├─ Monolith? → Structured logging (Pino/structlog) + Prometheus
    │   ├─ Frontend? → Web Vitals + performance budgets
    │   └─ All services? → Full stack (logs + metrics + traces)
    │
    ├─ Debugging Issues?
    │   ├─ Distributed system? → Distributed tracing (search by trace ID)
    │   ├─ Single service? → Structured logs (search by request ID)
    │   ├─ Performance problem? → CPU/memory profiling
    │   └─ Database slow? → Query profiling (EXPLAIN ANALYZE)
    │
    ├─ Reliability Targets?
    │   ├─ Define SLOs? → Availability, latency, error rate SLIs
    │   ├─ Error budgets? → Calculate allowed downtime per SLO
    │   ├─ Alerting? → Burn rate alerts (fast: 1h, slow: 6h)
    │   └─ Dashboard? → Grafana SLO dashboard with error budget
    │
    ├─ Performance Optimization?
    │   ├─ Find bottlenecks? → CPU/memory profiling
    │   ├─ Database queries? → EXPLAIN ANALYZE, indexing
    │   ├─ Frontend slow? → Lighthouse, Web Vitals analysis
    │   └─ Load testing? → k6 scenarios (ramp, spike, soak)
    │
    └─ Capacity Planning?
        ├─ Baseline metrics? → Collect 30 days of traffic data
        ├─ Load testing? → Test at 2x expected peak
        ├─ Forecasting? → Time series analysis (Prophet, ARIMA)
        └─ Cost optimization? → Right-size instances, spot instances

Navigation: Core Implementation Patterns

See resources/core-observability-patterns.md for detailed implementation guides:

OpenTelemetry End-to-End Setup - Complete instrumentation with Node.js/Python examples
- Three pillars of observability (logs, metrics, traces)
- Auto-instrumentation and manual spans
- OTLP exporters and collectors
- Production checklist
Distributed Tracing Strategy - Service-to-service trace propagation
- W3C Trace Context standard
- Sampling strategies (always-on, probabilistic, parent-based, adaptive)
- Cross-service correlation
- Trace backend configuration
SLO/SLI Design & Error Budgets - Reliability targets and alerting
- SLI definitions (availability, latency, error rate)
- Prometheus queries for SLIs
- Error budget calculation and policies
- Burn rate alerts (fast: 1h, slow: 6h)
Structured Logging - Production-ready JSON logs
- Log format with trace correlation
- Pino (Node.js) and structlog (Python) setup
- Log levels and what NOT to log
- Centralized aggregation (ELK, Loki, Datadog)
Performance Profiling - CPU, memory, database, frontend optimization
- Node.js profiling (Chrome DevTools, Clinic.js)
- Memory leak detection (heap snapshots)
- Database query profiling (EXPLAIN ANALYZE)
- Web Vitals and performance budgets
Capacity Planning - Scale planning and cost optimization
- Capacity formula and calculations
- Load testing with k6
- Resource forecasting (Prophet, ARIMA)
- Cost per request optimization

Navigation: Observability Maturity

See resources/observability-maturity-model.md for maturity assessment:

Level 1: Reactive (Firefighting) - Manual log grepping, hours to resolve
- Basic logging to files
- No structured logging or distributed tracing
- Progression: Centralize logs, add metrics
Level 2: Proactive (Monitoring) - Centralized logs, application alerts
- Structured JSON logs with ELK/Splunk
- Prometheus metrics and Grafana dashboards
- Progression: Add distributed tracing, define SLOs
Level 3: Predictive (Observability) - Unified telemetry, SLO-driven
- Distributed tracing (Jaeger, Tempo)
- Automatic trace-log-metric correlation
- SLO/SLI-based alerting with error budgets
- Progression: automated anomaly detection, automated remediation
Level 4: Automated Operations - Automation-driven, self-healing systems
- Automated anomaly detection and assisted root cause analysis
- Automated remediation with safe guards (feature flags, circuit breakers, rollbacks)
- Continuous optimization (performance budgets, capacity tuning)
- Chaos engineering for resilience (right-sized blast radius)

Maturity Assessment Tool - Rate your organization (0-5 per category):

Logging, metrics, tracing, alerting, incident response
MTTR/MTTD benchmarks by level
Recommended next steps

Navigation: Anti-Patterns & Best Practices

See resources/anti-patterns-best-practices.md for common mistakes:

Critical Anti-Patterns:

Logging Everything - Log bloat and high costs
No Sampling - 100% trace collection is expensive
Alert Fatigue - Too many noisy alerts
Ignoring Tail Latency - P99 matters more than average
No Error Budgets - Teams move too slow or too fast
Metrics Without Context - Dashboard mysteries
No Cost Tracking - Observability can cost 20% of infrastructure
Point-in-Time Profiling - Missing intermittent issues

Best Practices Summary:

Sample traces intelligently (100% errors, 1-10% success)
Alert on SLO burn rate, not infrastructure metrics
Track tail latency (P99, P999), not just averages
Use error budgets to balance velocity vs reliability
Add context to dashboards (baselines, annotations, SLOs)
Track observability costs, optimize aggressively
Continuous profiling for intermittent issues

Templates

See templates/ for copy-paste ready examples organized by domain and tech stack:

QA Checklists:

Observability Readiness Checklist - Logs/metrics/traces readiness for QA and fast debugging

OpenTelemetry Instrumentation:

Node.js/Express Setup - Auto-instrumentation, manual spans, Docker, K8s
Python/Flask Setup - Flask instrumentation, SQLAlchemy, deployment

Monitoring & SLO:

SLO YAML Template - Complete SLO definitions with error budgets
Prometheus Alert Rules - Burn rate alerts, multi-window monitoring
SLO Dashboard - SLI tracking, error budget visualization
Unified Observability Dashboard - Logs, metrics, traces in one view

Load Testing:

k6 Load Test Template - Ramp-up, spike, soak test scenarios
Artillery Load Test - YAML configuration, multiple scenarios

Performance Optimization:

Lighthouse CI Configuration - Performance budgets, CI/CD integration
Node.js Profiling Config - CPU/memory profiling, leak detection

Resources

See resources/ for deep-dive operational guides:

Core Observability Patterns - 6 implementation patterns with code examples
Observability Maturity Model - 4-level maturity framework with assessment
Anti-Patterns & Best Practices - 8 critical anti-patterns with solutions
OpenTelemetry Best Practices - Setup, sampling, attributes, context propagation
Distributed Tracing Patterns - Trace propagation, span design, debugging workflows
SLO Design Guide - SLI/SLO/SLA, error budgets, burn rate alerts
Performance Profiling Guide - CPU/memory profiling, database optimization, frontend performance

Optional: AI / Automation

Use AI assistance to reduce toil, not to replace the fundamentals of instrumentation, SLOs, and debugging.

Do:

Cluster similar alerts/incidents and propose likely suspects; verify via traces/logs/metrics.
Summarize incident timelines from telemetry to speed up postmortems.

Avoid:

“Black box” anomaly detection without explainability and a rollback plan.
Auto-remediation without guardrails and human review for high-severity actions.

External Resources

See data/sources.json for curated sources:

OpenTelemetry documentation and specifications
APM platforms (Datadog, New Relic, Dynatrace, Honeycomb)
Observability tools (Prometheus, Grafana, Jaeger, Tempo, Loki)
SRE books (Google SRE, Site Reliability Workbook)
Performance tooling (Lighthouse, k6, Clinic.js)
Web Vitals and Core Web Vitals

Related Skills

DevOps & Infrastructure:

../ops-devops-platform/SKILL.md - Kubernetes, Docker, CI/CD pipelines
../data-sql-optimization/SKILL.md - Database performance and optimization

Backend Development:

../software-backend/SKILL.md - Backend architecture and API design
../software-architecture-design/SKILL.md - System design patterns

Quality & Reliability:

../qa-resilience/SKILL.md - Circuit breakers, retries, graceful degradation
../qa-debugging/SKILL.md - Production debugging techniques
../qa-testing-strategy/SKILL.md - Testing strategies and automation

Quick Decision Matrix

Scenario	Recommendation
Starting new service	OpenTelemetry auto-instrumentation + structured logging
Debugging microservices	Distributed tracing with Jaeger/Tempo
Setting reliability targets	Define SLOs for availability, latency, error rate
Application is slow	CPU/memory profiling + database query analysis
Planning for scale	Load testing + capacity forecasting
High infrastructure costs	Cost per request analysis + right-sizing
Improving observability	Assess maturity level + 6-month roadmap
Frontend performance issues	Web Vitals monitoring + performance budgets

Usage Notes

When to Apply Patterns:

New service setup → Start with OpenTelemetry + structured logging + basic metrics
Microservices debugging → Use distributed tracing for full request visibility
Reliability requirements → Define SLOs first, then implement monitoring
Performance issues → Profile first, optimize second (measure before optimizing)
Capacity planning → Collect baseline metrics for 30 days before load testing
Observability maturity → Assess current level, plan 6-month progression

Common Workflows:

New Service: OpenTelemetry → Prometheus → Grafana → Define SLOs
Debugging: Search logs by trace ID → Click trace → See full request flow
Performance: Profile CPU/memory → Identify bottleneck → Optimize → Validate
Capacity Planning: Baseline metrics → Load test 2x peak → Forecast growth → Scale proactively

Optimization Priorities:

Correctness (logs, traces, metrics actually help debug)
Cost (optimize sampling, retention, cardinality)
Performance (observability overhead <5% latency)

Success Criteria: Systems are fully observable with unified telemetry (logs+metrics+traces), SLOs drive alerting and feature velocity, performance is proactively optimized, and capacity is planned ahead of demand.

Search AI Tools

qa-observability

Install this agent skill to your Project

SKILL.md