Agent skill

grafana-dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

Stars 5
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/camoneart/claude-code/tree/main/skills/grafana-dashboards

SKILL.md

Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

When to Use

  • Visualize Prometheus metrics
  • Create custom dashboards
  • Implement SLO dashboards
  • Monitor infrastructure
  • Track business KPIs

Dashboard Design Principles

1. Hierarchy of Information

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

2. RED Method (Services)

  • Rate - Requests per second
  • Errors - Error rate
  • Duration - Latency/response time

3. USE Method (Resources)

  • Utilization - % time resource is busy
  • Saturation - Queue length/wait time
  • Errors - Error count

Dashboard Structure

API Monitoring Dashboard

json
{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "type": "query"
            }
          ]
        },
        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
      }
    ]
  }
}

Reference: See assets/api-dashboard.json

Panel Types

1. Stat Panel (Single Value)

json
{
  "type": "stat",
  "title": "Total Requests",
  "targets": [{
    "expr": "sum(http_requests_total)"
  }],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 80, "color": "yellow"},
          {"value": 90, "color": "red"}
        ]
      }
    }
  }
}

2. Time Series Graph

json
{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [{
    "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
  }],
  "yaxes": [
    {"format": "percent", "max": 100, "min": 0},
    {"format": "short"}
  ]
}

3. Table Panel

json
{
  "type": "table",
  "title": "Service Status",
  "targets": [{
    "expr": "up",
    "format": "table",
    "instant": true
  }],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": {"Time": true},
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

4. Heatmap

json
{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [{
    "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
    "format": "heatmap"
  }],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

Variables

Query Variables

json
{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

Use Variables in Queries

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

Alerts in Dashboards

json
{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": {"type": "and"},
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": {"type": "avg"},
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [
      {"uid": "slack-channel"}
    ]
  }
}

Dashboard Provisioning

dashboards.yml:

yaml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'General'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Common Dashboard Patterns

Infrastructure Dashboard

Key Panels:

  • CPU utilization per node
  • Memory usage per node
  • Disk I/O
  • Network traffic
  • Pod count by namespace
  • Node status

Reference: See assets/infrastructure-dashboard.json

Database Dashboard

Key Panels:

  • Queries per second
  • Connection pool usage
  • Query latency (P50, P95, P99)
  • Active connections
  • Database size
  • Replication lag
  • Slow queries

Reference: See assets/database-dashboard.json

Application Dashboard

Key Panels:

  • Request rate
  • Error rate
  • Response time (percentiles)
  • Active users/sessions
  • Cache hit rate
  • Queue length

Best Practices

  1. Start with templates (Grafana community dashboards)
  2. Use consistent naming for panels and variables
  3. Group related metrics in rows
  4. Set appropriate time ranges (default: Last 6 hours)
  5. Use variables for flexibility
  6. Add panel descriptions for context
  7. Configure units correctly
  8. Set meaningful thresholds for colors
  9. Use consistent colors across dashboards
  10. Test with different time ranges

Dashboard as Code

Terraform Provisioning

hcl
resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

Ansible Provisioning

yaml
- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

Reference Files

  • assets/api-dashboard.json - API monitoring dashboard
  • assets/infrastructure-dashboard.json - Infrastructure dashboard
  • assets/database-dashboard.json - Database monitoring dashboard
  • references/dashboard-design.md - Dashboard design guide

Related Skills

  • prometheus-configuration - For metric collection
  • slo-implementation - For SLO dashboards

Expand your agent's capabilities with these related and highly-rated skills.

camoneart/claude-code

translating-technical-articles

Translates English technical articles (engineering blogs, documentation) to Japanese while preserving layout and structure. Use when the user asks to translate an article, convert English content to Japanese, or mentions translating a URL or technical blog post.

5 0
Explore
camoneart/claude-code

guiding-tdd-development

Guide Test-Driven Development with task splitting, Red-Green-Refactor cycle, and framework auto-detection. Use when developing features with TDD approach, fixing bugs test-first, or when user mentions "TDD", "テスト駆動開発", "test-first", "/tdd".

5 0
Explore
camoneart/claude-code

distributed-tracing

Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.

5 0
Explore
camoneart/claude-code

dependency-upgrade

Manage major dependency version upgrades with compatibility analysis, staged rollout, and comprehensive testing. Use when upgrading framework versions, updating major dependencies, or managing breaking changes in libraries.

5 0
Explore
camoneart/claude-code

stripe-integration

Implement Stripe payment processing for robust, PCI-compliant payment flows including checkout, subscriptions, and webhooks. Use when integrating Stripe payments, building subscription systems, or implementing secure checkout flows.

5 0
Explore
camoneart/claude-code

typescript-advanced-types

Master TypeScript's advanced type system including generics, conditional types, mapped types, template literals, and utility types for building type-safe applications. Use when implementing complex type logic, creating reusable type utilities, or ensuring compile-time type safety in TypeScript projects.

5 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results