Agent skill

observability-monitoring

Structured logging, metrics, distributed tracing, and alerting strategies

Stars 232
Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/ariegoldkin/observability-monitoring

SKILL.md

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

When to Use

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘

Structured Logging

Log Levels

Level Use Case
ERROR Unhandled exceptions, failed operations
WARN Deprecated API, retry attempts
INFO Business events, successful operations
DEBUG Development troubleshooting

Best Practice

typescript
// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);

See templates/structured-logging.ts for Winston setup and request middleware

Metrics Collection

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

typescript
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

See templates/prometheus-metrics.ts for full metrics configuration

Distributed Tracing

OpenTelemetry Setup

Auto-instrument common libraries:

  • Express/HTTP
  • PostgreSQL
  • Redis

Manual Spans

typescript
tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});

See templates/opentelemetry-tracing.ts for full setup

Alerting Strategy

Severity Levels

Level Response Time Examples
Critical (P1) < 15 min Service down, data loss
High (P2) < 1 hour Major feature broken
Medium (P3) < 4 hours Increased error rate
Low (P4) Next day Warnings

Key Alerts

Alert Condition Severity
ServiceDown up == 0 for 1m Critical
HighErrorRate 5xx > 5% for 5m Critical
HighLatency p95 > 2s for 5m High
LowCacheHitRate < 70% for 10m Medium

See templates/alerting-rules.yml for Prometheus alerting rules

Health Checks

Kubernetes Probes

Probe Purpose Endpoint
Liveness Is app running? /health
Readiness Ready for traffic? /ready
Startup Finished starting? /startup

Readiness Response

json
{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}

See templates/health-checks.ts for implementation

Observability Checklist

Implementation

  • JSON structured logging
  • Request correlation IDs
  • RED metrics (Rate, Errors, Duration)
  • Business metrics
  • Distributed tracing
  • Health check endpoints

Alerting

  • Service outage alerts
  • Error rate thresholds
  • Latency thresholds
  • Resource utilization alerts

Dashboards

  • Service overview
  • Error analysis
  • Performance metrics

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

Templates Reference

Template Purpose
structured-logging.ts Winston logger with request middleware
prometheus-metrics.ts HTTP, DB, cache metrics with middleware
opentelemetry-tracing.ts Distributed tracing setup
alerting-rules.yml Prometheus alerting rules
health-checks.ts Liveness, readiness, startup probes

Expand your agent's capabilities with these related and highly-rated skills.

aiskillstore/marketplace

perigon-backend

Perigon ASP.NET Core + EF Core + Aspire conventions

232 15
Explore
aiskillstore/marketplace

perigon-agent

Pointers for Copilot/agents to apply Perigon conventions

232 15
Explore
aiskillstore/marketplace

perigon-angular

Angular 21+ standalone/Material/signal conventions for Perigon WebApp

232 15
Explore
aiskillstore/marketplace

fastapi-mastery

Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.

232 15
Explore
aiskillstore/marketplace

context7-efficient

Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.

232 15
Explore
aiskillstore/marketplace

browser-use

Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.

232 15
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results