Agent skill
investigate-incident
Investigate platform incidents, perform RCA, create incident documentation, and follow alert runbooks in the Kagenti platform
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/investigate-incident
SKILL.md
Investigate Incident Skill
This skill helps you investigate incidents, perform root cause analysis (RCA), and create comprehensive incident documentation.
When to Use
- After alerts fire (check runbooks first)
- When tests fail unexpectedly
- User reports service unavailable
- Pods in CrashLoopBackOff or ImagePullBackOff
- Services showing degraded performance
- After deployment issues
What This Skill Does
- Follow Runbooks: Execute alert-specific investigation steps
- Gather Evidence: Collect logs, metrics, events, pod status
- Root Cause Analysis: Identify underlying issues
- Document Incidents: Create structured RCA in TODO_INCIDENTS.md
- Track Fixes: Plan and validate remediation
Investigation Workflow
1. Check If Alert Has Runbook
# List all available runbooks
ls -1 docs/runbooks/alerts/*.md
# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down' # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"
2. Follow Runbook Steps
Runbooks are located at: docs/runbooks/alerts/<alert-uid>.md
Standard runbook sections:
- Meaning - What this alert indicates
- Impact - Business/platform impact
- Diagnosis - Investigation commands
- Mitigation - How to fix
Example: Follow Prometheus Down runbook:
# From docs/runbooks/alerts/prometheus-down.md
# 1. Check pod status
kubectl get pods -n observability -l app=prometheus
# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100
# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'
# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
curl -s http://prometheus.observability.svc:9090/-/ready
3. Gather Comprehensive Evidence
Pod Status & Events:
# Get pod status in namespace
kubectl get pods -n <namespace>
# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>
# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"
Logs:
# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100
# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous
# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>
# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true
# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
--data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
--data-urlencode 'limit=100' \
--data-urlencode "start=$(date -u -v-5M +%s)000000000" \
--data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool
Metrics:
# Check if service is up
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool
# Check replica availability
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
| python3 -m json.tool
# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
| python3 -m json.tool
ArgoCD Application Status:
# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web
# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"
# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
4. Identify Root Cause
Common Root Causes:
-
Configuration Error
- Check recent Git commits:
git log --oneline -10 - Check ArgoCD diff:
argocd app diff <app-name> --port-forward ... - Validate YAML:
kustomize build components/...
- Check recent Git commits:
-
Image Issues
- Check image availability:
docker exec kagenti-demo-control-plane crictl images | grep <image> - Check ImagePullBackOff:
kubectl describe pod <pod-name> | grep -A10 "Events"
- Check image availability:
-
Resource Constraints
- Check pod resources:
kubectl top pods -n <namespace> - Check node resources:
kubectl top nodes - Check OOM kills:
kubectl get events -A | grep OOM
- Check pod resources:
-
Dependency Failure
- Check if dependent service is healthy
- Check service endpoints:
kubectl get endpoints -n <namespace> - Test connectivity:
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- curl http://service-name
-
mTLS/Network Issues
- Check Istio sidecar:
kubectl get pods -n <namespace>(should show 2/2) - Check PeerAuthentication:
kubectl get peerauthentication -A - Check sidecar logs:
kubectl logs -n <namespace> <pod-name> -c istio-proxy
- Check Istio sidecar:
-
Certificate Issues
- Check certificate status:
kubectl get certificate -A - Check cert-manager logs:
kubectl logs -n cert-manager deployment/cert-manager
- Check certificate status:
5. Create Incident Documentation
Create entry in TODO_INCIDENTS.md:
## Incident #X: [Alert Name] - [Brief Description]
**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved
**Detected**: 2025-11-17 08:31:40 UTC
**Severity**: Critical / Warning / Info
**Components Affected**:
- Component 1
- Component 2
### Summary
Brief description of what happened.
### Investigation
**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause
**Evidence Collected**:
1. **Pod Status**:
NAME READY STATUS RESTARTS component-xxx 0/2 CrashLoopBackOff 5
2. **Error Logs**:
ERROR: Failed to connect to database: connection refused
3. **Events**:
Back-off restarting failed container
### Root Cause Analysis
**Root Cause**: [Specific technical reason]
**Why it happened**:
- Contributing factor 1
- Contributing factor 2
**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`
### Resolution
**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...
Verification:
# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state
Time to Resolution: XX minutes
Lessons Learned
- What went well: Early detection via alerting
- What could improve: Need better validation before deploy
- Action items:
- Add pre-deployment validation
- Update runbook with this scenario
- Add integration test for this case
Related
- Alert:
alert-uid - Runbook: docs/runbooks/alerts/alert-uid.md
- Git commits: abc1234, def5678
### 6. Validate Fix
**After applying fix**:
```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>
# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
-u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"
# 3. Run integration tests
pytest tests/integration/test_<component>.py -v
# 4. Check platform status
./scripts/platform-status.sh
# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list
Common Investigation Patterns
Pattern 1: Pod Won't Start
# Investigation flow
kubectl get pods -n <namespace> # Get pod status
kubectl describe pod <pod-name> -n <namespace> # Check events
kubectl logs <pod-name> -n <namespace> # Check logs (if available)
# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed
Pattern 2: Service Unavailable
# Investigation flow
kubectl get pods -n <namespace> -l app=<service> # Check pods
kubectl get svc -n <namespace> <service-name> # Check service
kubectl get endpoints -n <namespace> <service-name> # Check endpoints
kubectl get httproute -n <namespace> # Check routes (if using Gateway API)
# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
-- curl -v http://<service-name>.<namespace>.svc:PORT
Pattern 3: High Resource Usage
# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu # CPU usage
# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
# Check for OOM kills
kubectl get events -n <namespace> | grep OOM
Pattern 4: Frequent Restarts
# Investigation flow
kubectl get pods -n <namespace> # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace> # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous # Logs before crash
# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'
Pattern 5: ArgoCD App OutOfSync
# Investigation flow
argocd app get <app-name> --port-forward ... # Get status
argocd app diff <app-name> --port-forward ... # See differences
# Check last sync
argocd app history <app-name> --port-forward ...
# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...
Integration with Other Skills
Use check-alerts skill to find which alerts are firing:
# This automatically invokes check-alerts skill
"What alerts are currently firing?"
Use check-logs skill to query specific error patterns:
# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"
Use check-metrics skill to verify service health:
# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"
Incident Priority Guidelines
Critical (P0):
- Platform-wide outage
- All users affected
- Data loss risk
- Security breach
High (P1):
- Major feature unavailable
- Multiple users affected
- Performance severely degraded
Medium (P2):
- Minor feature unavailable
- Some users affected
- Workaround available
Low (P3):
- Cosmetic issue
- Minimal impact
- Can be fixed in next release
Related Documentation
- TODO_INCIDENTS.md - Active incident tracking
- docs/runbooks/alerts/ - Alert-specific runbooks
- CLAUDE.md Alert Monitoring - Alert workflow
- docs/04-observability/ALERT_TESTING_GUIDE.md - Alert testing
Pro Tips
- Always check runbook first: Save time by following proven steps
- Capture evidence early: Logs/events may be lost if pod restarts
- Use --previous logs: For crashed pods, previous container logs are critical
- Check Git history: Recent commits often reveal configuration changes
- Document as you go: Don't wait until end to write RCA
- Verify fix completely: Check pods, alerts, tests, and platform status
- Update runbook: Add new scenarios discovered during investigation
🤖 Generated with Claude Code
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?