Agent skill

k8s-incident

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

View SKILL.md on GitHub Repository

Stars 865

Forks 168

Install this agent skill to your Project

npx add-skill https://github.com/rohitg00/kubectl-mcp-server/tree/main/kubernetes-skills/claude/k8s-incident

Metadata

Additional technical details for this skill

tools: 15
author: rohitg00
version: 1.0.0
category: observability

SKILL.md

Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

When to Apply

Use this skill when:

User mentions: "incident", "outage", "emergency", "down", "not working"
Operations: emergency response, production issues, service degradation
Keywords: "urgent", "broken", "fix", "restore", "recover"

Priority Rules

Priority	Rule	Impact	Tools
1	Check control plane first	CRITICAL	`get_pods(namespace="kube-system")`
2	Assess node health	CRITICAL	`get_nodes`
3	Gather events before changes	HIGH	`get_events`
4	Document timeline	HIGH	Manual notes
5	Rollback if safe	MEDIUM	`rollback_deployment`

Quick Reference

Incident	First Tool	Next Steps
Pod failure	`get_pod_logs(previous=True)`	`describe_pod`, `get_events`
Node down	`describe_node`	Check kubelet logs
Service unreachable	`get_endpoints`	`get_network_policies`
Control plane	`get_pods(namespace="kube-system")`	Check API server logs

Incident Triage

Quick Health Check

python

get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)

Severity Assessment

Indicator	Severity	Action
Multiple nodes NotReady	Critical	Escalate immediately
kube-system pods failing	Critical	Control plane issue
Single pod CrashLoop	Medium	Debug pod
High latency	Medium	Check resources

Runbook: Pod Failures

CrashLoopBackOff

python

get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)

Common Causes:

OOMKilled → Increase memory limits
Exit code 1 → Application error in logs
Exit code 137 → Killed by OOM or SIGKILL
Exit code 143 → Graceful SIGTERM

ImagePullBackOff

python

describe_pod(name, namespace)
get_secrets(namespace)

Pending Pod

python

describe_pod(name, namespace)
get_nodes()
get_events(namespace)

Runbook: Node Issues

Node NotReady

python

describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")

Node DiskPressure

python

describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")

Runbook: Network Issues

Service Not Accessible

python

get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)

DNS Resolution Failures

python

get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")

With Cilium

python

cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

python

istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

python

describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)

Pod Stuck in ContainerCreating

python

describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)

Runbook: Control Plane Issues

API Server Unavailable

python

get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")

etcd Issues

python

get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Force Delete Pod

python

delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

python

rollback_deployment(name, namespace, revision=0)

Helm Rollback

python

rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Multi-Cluster Incident Response

Check all clusters:

python

for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

When did the incident start?
What was the impact?
What was the root cause?
What fixed it?

Prevent Recurrence

Add monitoring/alerting
Improve resource limits
Add readiness probes
Document runbook

Related Skills

k8s-troubleshoot - Detailed debugging
k8s-security - Security incidents

Maintainer

rohitg00 Core maintainer

Source details

Full Name: rohitg00/kubectl-mcp-server
Branch: main
Path in repo: kubernetes-skills/claude/k8s-incident
License: MIT License
Topics: ai mcp mcp-server llms genai devops npm deployment kubernetes kubernetes-cluster kubernetes-tools pypi

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

rohitg00/kubectl-mcp-server

k8s-multicluster

Manage multiple Kubernetes clusters, switch contexts, and perform cross-cluster operations. Use when working with multiple clusters, comparing environments, or managing cluster lifecycle.

865 168

Explore

rohitg00/kubectl-mcp-server

k8s-gitops

Manage GitOps workflows with Flux and ArgoCD. Use for sync status, reconciliation, app management, source management, and GitOps troubleshooting.

865 168

Explore

rohitg00/kubectl-mcp-server

k8s-autoscaling

Configure Kubernetes autoscaling with HPA, VPA, and KEDA. Use for horizontal/vertical pod autoscaling, event-driven scaling, and capacity management.

865 168

Explore

rohitg00/kubectl-mcp-server

k8s-deploy

Deploy and manage Kubernetes workloads with progressive delivery. Use for deployments, rollouts, blue-green, canary releases, scaling, and release management.

865 168

Explore

rohitg00/kubectl-mcp-server

k8s-cost

Optimize Kubernetes costs through resource right-sizing, unused resource detection, and cluster efficiency analysis. Use for cost optimization, resource analysis, and capacity planning.

865 168

Explore

rohitg00/kubectl-mcp-server

k8s-rollouts

Progressive delivery with Argo Rollouts and Flagger. Use when implementing canary deployments, blue-green deployments, or traffic shifting strategies.

865 168

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Kubernetes Incident Response

When to Apply

Priority Rules

Quick Reference

Incident Triage

Quick Health Check

Severity Assessment

Runbook: Pod Failures

CrashLoopBackOff

ImagePullBackOff

Pending Pod

Runbook: Node Issues

Node NotReady

Node DiskPressure

Runbook: Network Issues

Service Not Accessible

DNS Resolution Failures

With Cilium

With Istio

Runbook: Storage Issues

PVC Pending

Pod Stuck in ContainerCreating

Runbook: Control Plane Issues

API Server Unavailable

etcd Issues

Emergency Actions

Force Delete Pod

Rollback Deployment

Helm Rollback

Diagnostic Collection Script

Multi-Cluster Incident Response

Post-Incident

Document Timeline

Prevent Recurrence

Related Skills

Recommended Agent Skills

k8s-multicluster

k8s-gitops

k8s-autoscaling

k8s-deploy

k8s-cost

k8s-rollouts