monitoring-setup

Observability stack with Prometheus, Grafana, and alerting.

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/monitoring-setup

SKILL.md

Monitoring Setup

The Three Pillars

Pillar	Tool	Purpose
Metrics	Prometheus	Time-series data
Logs	Loki / ELK	Event records
Traces	Jaeger / Tempo	Request flow

Prometheus

yaml

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Grafana Dashboard

json

{
  "panels": [
    {
      "title": "Request Rate",
      "targets": [
        {
          "expr": "rate(http_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    }
  ]
}

Alert Rules

yaml

groups:
  - name: app
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning

Key Metrics

RED Method (Services)

Rate - Requests per second
Errors - Failed requests
Duration - Response time

USE Method (Resources)

Utilization - % busy
Saturation - Queue depth
Errors - Error count

SLIs/SLOs

SLI: 99th percentile latency < 200ms
SLO: 99.9% of requests meet SLI
Error Budget: 0.1% of requests can exceed SLI