Agent skill

observability-alert-manager

Configure Grafana alerts for Claude Code anomalies and thresholds. Use when setting up monitoring alerts for sessions, errors, context usage, or subagents.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/observability-alert-manager

SKILL.md

Observability Alert Manager

Configure and manage Grafana alerts for Claude Code monitoring using enhanced telemetry.

Data Source

Primary: {job="claude_code_enhanced"} in Loki

Operations

create-alert

Define new alert rule. Parameters: name, query (LogQL), threshold, duration, severity, notification.

list-alerts

Show all configured alerts and their status.

test-alert

Simulate alert conditions.

delete-alert

Remove alert rule.

Pre-built Alert Templates

Session Alerts

  1. Long Session Duration: Session >1 hour

    logql
    {job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600
    
  2. High Turn Count: Session >50 turns

    logql
    {job="claude_code_enhanced", event_type="session_end"} | json | turn_count > 50
    
  3. Session Error Spike: >5 errors in session

    logql
    {job="claude_code_enhanced", event_type="session_end"} | json | error_count > 5
    

Error Alerts

  1. High Error Rate: >5 errors/hour

    logql
    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5
    
  2. Specific Tool Failures: Bash errors

    logql
    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error", tool="Bash"} [1h]) > 3
    

Context Alerts

  1. High Context Usage: >80% context window

    logql
    {job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80
    
  2. Auto Compaction Triggered: Context full

    logql
    {job="claude_code_enhanced", event_type="context_compact", trigger="auto"}
    

Subagent Alerts

  1. Excessive Subagent Spawning: >10 subagents/session
    logql
    {job="claude_code_enhanced", event_type="session_end"} | json | subagents_spawned > 10
    

Activity Alerts

  1. Telemetry Staleness: No data >10min

    logql
    absent_over_time({job="claude_code_enhanced"} [10m])
    
  2. Unusual Activity Spike: >100 tool calls/hour

    logql
    count_over_time({job="claude_code_enhanced", event_type="tool_call"} [1h]) > 100
    

Prompt Pattern Alerts

  1. Debugging Session Spike: Many debugging prompts
    logql
    count_over_time({job="claude_code_enhanced", event_type="user_prompt", pattern="debugging"} [1h]) > 10
    

Example Alert Configurations

Create High Error Rate Alert

bash
create-alert \
  --name "High Error Rate" \
  --query 'count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5' \
  --severity warning \
  --notification slack

Create Context Usage Alert

bash
create-alert \
  --name "High Context Usage" \
  --query '{job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80' \
  --severity info \
  --notification email

Create Session Duration Alert

bash
create-alert \
  --name "Long Session Warning" \
  --query '{job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600' \
  --severity info \
  --notification dashboard

Grafana Alert Setup

Via Grafana UI

  1. Navigate to Alerting → Alert rules
  2. Create new rule with Loki data source
  3. Enter LogQL query from templates above
  4. Configure conditions and notifications

Via API

bash
curl -X POST http://localhost:3000/api/ruler/grafana/api/v1/rules/claude-code \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "name": "claude-code-alerts",
    "rules": [
      {
        "alert": "HighErrorRate",
        "expr": "count_over_time({job=\"claude_code_enhanced\", status=\"error\"} [1h]) > 5",
        "for": "5m",
        "labels": {"severity": "warning"},
        "annotations": {"summary": "High error rate detected"}
      }
    ]
  }'

Notification Channels

  • Slack: Webhook integration
  • Email: SMTP configuration
  • PagerDuty: Incident management
  • Dashboard: On-screen annotations

Alert Severity Levels

Level Use Case
critical Immediate action required
warning Needs attention soon
info Informational, no action needed

Scripts

  • scripts/create-alert.sh - Create new alert
  • scripts/list-alerts.sh - List all alerts
  • scripts/test-alerts.sh - Test alert conditions
  • scripts/import-alert-templates.sh - Import all pre-built templates

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results