Agent skill

monitoring

Master Kubernetes observability, monitoring with Prometheus, logging, metrics, and distributed tracing. Learn to implement comprehensive monitoring strategies.

View SKILL.md on GitHub Repository

Stars 1

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-kubernetes/tree/main/skills/monitoring

SKILL.md

Kubernetes Monitoring & Observability

Executive Summary

Production-grade Kubernetes observability covering the complete stack from infrastructure metrics to application tracing. This skill provides deep expertise in implementing SLO-based monitoring, multi-signal observability, and proactive alerting for enterprise environments.

Core Competencies

1. Metrics with Prometheus

Prometheus Stack Installation

bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  --set grafana.adminPassword=secure-password \
  --set prometheus.prometheusSpec.retention=30d

Essential PromQL Queries

promql

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory utilization
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
  / sum(container_spec_memory_limit_bytes{namespace="production"}) by (pod) * 100

# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate (5xx)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

ServiceMonitor

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-server
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: api-server
  namespaceSelector:
    matchNames:
    - production
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

2. Logging with Loki

Loki Stack

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
data:
  promtail.yaml: |
    server:
      http_listen_port: 3101
    positions:
      filename: /tmp/positions.yaml
    clients:
    - url: http://loki:3100/loki/api/v1/push
    scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

LogQL Queries

logql

# Errors in production
{namespace="production"} |= "error"

# JSON log parsing
{app="api-server"} | json | status >= 500

# Rate of errors
rate({namespace="production"} |= "error" [5m])

3. Tracing with OpenTelemetry

OpenTelemetry Collector

yaml

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: deployment
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        timeout: 10s
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

4. SLO-Based Alerting

SLO Definition

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-server-slo
spec:
  groups:
  - name: slo.rules
    rules:
    # Availability SLO: 99.9%
    - record: slo:availability:ratio
      expr: |
        sum(rate(http_requests_total{status!~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))

    # Latency SLO: P99 < 200ms
    - record: slo:latency:p99
      expr: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

  - name: slo.alerts
    rules:
    - alert: HighErrorRate
      expr: (1 - slo:availability:ratio) > 0.001
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Error rate exceeds SLO (>0.1%)"

    - alert: HighLatency
      expr: slo:latency:p99 > 0.2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "P99 latency exceeds 200ms"

5. Alertmanager Configuration

yaml

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      receiver: 'default'
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
      - match:
          severity: warning
        receiver: 'slack'
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        api_url: '${SLACK_WEBHOOK}'
    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: '${PD_SERVICE_KEY}'
    - name: 'slack'
      slack_configs:
      - channel: '#alerts'

Integration Patterns

Uses skill: cluster-admin

Control plane metrics
Node resource monitoring

Coordinates with skill: deployments

Rollout monitoring
Autoscaling metrics

Works with skill: security

Security event alerting
Audit log analysis

Troubleshooting Guide

Decision Tree: Observability Issues

Monitoring Problem?
│
├── No metrics
│   ├── Check ServiceMonitor selector
│   ├── Verify /metrics endpoint
│   └── Check Prometheus targets
│
├── Missing logs
│   ├── Check Promtail/Fluentbit pods
│   ├── Verify log format
│   └── Check Loki ingestion
│
└── Alert not firing
    ├── Check PromQL expression
    ├── Verify thresholds
    └── Check Alertmanager routes

Debug Commands

bash

# Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090
# Visit /targets

# Grafana access
kubectl port-forward -n monitoring svc/grafana 3000

# Check ServiceMonitors
kubectl get servicemonitors -A

# Alertmanager status
kubectl port-forward -n monitoring svc/alertmanager 9093

Common Challenges & Solutions

Challenge	Solution
High cardinality	Reduce labels, aggregation
Retention costs	Tiered storage, downsampling
Alert fatigue	SLO-based alerting
Missing traces	Auto-instrumentation

Success Criteria

Metric	Target
Metric collection	100% services
Log retention	30 days
Alert response	<5 minutes
Dashboard coverage	All critical

Resources

Maintainer

pluginagentmarketplace Core maintainer

Source details

Full Name: pluginagentmarketplace/custom-plugin-kubernetes
Branch: main
Path in repo: skills/monitoring
License: Other

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

pluginagentmarketplace/custom-plugin-kubernetes

gitops

Master GitOps practices, CI/CD integration, Helm charts, Kustomize, and ArgoCD. Learn modern deployment patterns and infrastructure as code.

1 0

Explore

pluginagentmarketplace/custom-plugin-kubernetes

deployments

Master Kubernetes Deployments, StatefulSets, DaemonSets, and workload orchestration. Learn deployment patterns and container orchestration strategies.

1 0

Explore

pluginagentmarketplace/custom-plugin-kubernetes

cluster-admin

Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies.

1 0

Explore

pluginagentmarketplace/custom-plugin-kubernetes

troubleshooting

Kubernetes debugging, problem diagnosis, and issue resolution

1 0

Explore

pluginagentmarketplace/custom-plugin-kubernetes

helm

Helm package management, chart development, and release management

1 0

Explore

pluginagentmarketplace/custom-plugin-kubernetes

multi-cluster

Multi-cluster Kubernetes management, federation, and hybrid deployments

1 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Kubernetes Monitoring & Observability

Executive Summary

Core Competencies

1. Metrics with Prometheus

2. Logging with Loki

3. Tracing with OpenTelemetry

4. SLO-Based Alerting

5. Alertmanager Configuration

Integration Patterns

Uses skill: cluster-admin

Coordinates with skill: deployments

Works with skill: security

Troubleshooting Guide

Decision Tree: Observability Issues

Debug Commands

Common Challenges & Solutions

Success Criteria

Resources

Recommended Agent Skills

gitops

deployments

cluster-admin

troubleshooting

helm

multi-cluster