Agent skills
observability-monitoring

Agent skill

observability-monitoring

Structured logging, metrics, distributed tracing, and alerting strategies

View SKILL.md on GitHub Repository

Stars 232

Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/ariegoldkin/observability-monitoring

SKILL.md

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

When to Use

Setting up application monitoring
Implementing structured logging
Adding metrics and dashboards
Configuring distributed tracing
Creating alerting rules
Debugging production issues

Three Pillars of Observability

┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘

Structured Logging

Log Levels

Level	Use Case
ERROR	Unhandled exceptions, failed operations
WARN	Deprecated API, retry attempts
INFO	Business events, successful operations
DEBUG	Development troubleshooting

Best Practice

typescript

// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);

See templates/structured-logging.ts for Winston setup and request middleware

Metrics Collection

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

Rate - Requests per second
Errors - Failed requests per second
Duration - Request latency distribution

Prometheus Buckets

typescript

// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

See templates/prometheus-metrics.ts for full metrics configuration

Distributed Tracing

OpenTelemetry Setup

Auto-instrument common libraries:

Express/HTTP
PostgreSQL
Redis

Manual Spans

typescript

tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});

See templates/opentelemetry-tracing.ts for full setup

Alerting Strategy

Severity Levels

Level	Response Time	Examples
Critical (P1)	< 15 min	Service down, data loss
High (P2)	< 1 hour	Major feature broken
Medium (P3)	< 4 hours	Increased error rate
Low (P4)	Next day	Warnings

Key Alerts

Alert	Condition	Severity
ServiceDown	`up == 0` for 1m	Critical
HighErrorRate	5xx > 5% for 5m	Critical
HighLatency	p95 > 2s for 5m	High
LowCacheHitRate	< 70% for 10m	Medium

See templates/alerting-rules.yml for Prometheus alerting rules

Health Checks

Kubernetes Probes

Probe	Purpose	Endpoint
Liveness	Is app running?	`/health`
Readiness	Ready for traffic?	`/ready`
Startup	Finished starting?	`/startup`

Readiness Response

json

{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}

See templates/health-checks.ts for implementation

Observability Checklist

Implementation

JSON structured logging
Request correlation IDs
RED metrics (Rate, Errors, Duration)
Business metrics
Distributed tracing
Health check endpoints

Alerting

Service outage alerts
Error rate thresholds
Latency thresholds
Resource utilization alerts

Dashboards

Service overview
Error analysis
Performance metrics

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

Incident investigation - Correlating logs, metrics, traces
Alert tuning - Reducing noise, catching real issues
Architecture decisions - Choosing monitoring solutions
Performance debugging - Cross-service latency analysis

Templates Reference

Template	Purpose
`structured-logging.ts`	Winston logger with request middleware
`prometheus-metrics.ts`	HTTP, DB, cache metrics with middleware
`opentelemetry-tracing.ts`	Distributed tracing setup
`alerting-rules.yml`	Prometheus alerting rules
`health-checks.ts`	Liveness, readiness, startup probes

Maintainer

aiskillstore Core maintainer

Source details

Full Name: aiskillstore/marketplace
Branch: main
Path in repo: skills/ariegoldkin/observability-monitoring
Topics: claude-code claude codex-skills skills codex claude-skills ai-skills

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

aiskillstore/marketplace

perigon-backend

Perigon ASP.NET Core + EF Core + Aspire conventions

232 15

Explore

aiskillstore/marketplace

perigon-agent

Pointers for Copilot/agents to apply Perigon conventions

232 15

Explore

aiskillstore/marketplace

perigon-angular

Angular 21+ standalone/Material/signal conventions for Perigon WebApp

232 15

Explore

aiskillstore/marketplace

fastapi-mastery

Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.

232 15

Explore

aiskillstore/marketplace

context7-efficient

Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.

232 15

Explore

aiskillstore/marketplace

browser-use

Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.

232 15

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Observability & Monitoring Skill

When to Use

Three Pillars of Observability

Structured Logging

Log Levels

Best Practice

Metrics Collection

RED Method (Rate, Errors, Duration)

Prometheus Buckets

Distributed Tracing

OpenTelemetry Setup

Manual Spans

Alerting Strategy

Severity Levels

Key Alerts

Health Checks

Kubernetes Probes

Readiness Response

Observability Checklist

Implementation

Alerting

Dashboards

Extended Thinking Triggers

Templates Reference

Recommended Agent Skills

perigon-backend

perigon-agent

perigon-angular

fastapi-mastery

context7-efficient

browser-use