Agent skill

k8s-incident

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

Stars 865
Forks 168

Install this agent skill to your Project

npx add-skill https://github.com/rohitg00/kubectl-mcp-server/tree/main/kubernetes-skills/claude/k8s-incident

Metadata

Additional technical details for this skill

tools
15
author
rohitg00
version
1.0.0
category
observability

SKILL.md

Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

When to Apply

Use this skill when:

  • User mentions: "incident", "outage", "emergency", "down", "not working"
  • Operations: emergency response, production issues, service degradation
  • Keywords: "urgent", "broken", "fix", "restore", "recover"

Priority Rules

Priority Rule Impact Tools
1 Check control plane first CRITICAL get_pods(namespace="kube-system")
2 Assess node health CRITICAL get_nodes
3 Gather events before changes HIGH get_events
4 Document timeline HIGH Manual notes
5 Rollback if safe MEDIUM rollback_deployment

Quick Reference

Incident First Tool Next Steps
Pod failure get_pod_logs(previous=True) describe_pod, get_events
Node down describe_node Check kubelet logs
Service unreachable get_endpoints get_network_policies
Control plane get_pods(namespace="kube-system") Check API server logs

Incident Triage

Quick Health Check

python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)

Severity Assessment

Indicator Severity Action
Multiple nodes NotReady Critical Escalate immediately
kube-system pods failing Critical Control plane issue
Single pod CrashLoop Medium Debug pod
High latency Medium Check resources

Runbook: Pod Failures

CrashLoopBackOff

python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)

Common Causes:

  • OOMKilled → Increase memory limits
  • Exit code 1 → Application error in logs
  • Exit code 137 → Killed by OOM or SIGKILL
  • Exit code 143 → Graceful SIGTERM

ImagePullBackOff

python
describe_pod(name, namespace)
get_secrets(namespace)

Pending Pod

python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)

Runbook: Node Issues

Node NotReady

python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")

Node DiskPressure

python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")

Runbook: Network Issues

Service Not Accessible

python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)

DNS Resolution Failures

python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")

With Cilium

python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

python
istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)

Pod Stuck in ContainerCreating

python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)

Runbook: Control Plane Issues

API Server Unavailable

python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")

etcd Issues

python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Force Delete Pod

python
delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

python
rollback_deployment(name, namespace, revision=0)

Helm Rollback

python
rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Multi-Cluster Incident Response

Check all clusters:

python
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

  1. When did the incident start?
  2. What was the impact?
  3. What was the root cause?
  4. What fixed it?

Prevent Recurrence

  • Add monitoring/alerting
  • Improve resource limits
  • Add readiness probes
  • Document runbook

Related Skills

  • k8s-troubleshoot - Detailed debugging
  • k8s-security - Security incidents

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results