Agent skill
monitoring-setup
Observability stack with Prometheus, Grafana, and alerting.
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/monitoring-setup
SKILL.md
Monitoring Setup
The Three Pillars
| Pillar | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus | Time-series data |
| Logs | Loki / ELK | Event records |
| Traces | Jaeger / Tempo | Request flow |
Prometheus
yaml
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Grafana Dashboard
json
{
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
}
]
}
Alert Rules
yaml
groups:
- name: app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
Key Metrics
RED Method (Services)
- Rate - Requests per second
- Errors - Failed requests
- Duration - Response time
USE Method (Resources)
- Utilization - % busy
- Saturation - Queue depth
- Errors - Error count
SLIs/SLOs
SLI: 99th percentile latency < 200ms
SLO: 99.9% of requests meet SLI
Error Budget: 0.1% of requests can exceed SLI
Didn't find tool you were looking for?