Agent skills
kubernetes-troubleshooting

Agent skill

kubernetes-troubleshooting

Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting. Use this skill when: (1) Proactive cluster health assessment and security analysis (2) Analyzing pod/container logs for errors or issues (3) Interpreting cluster events (kubectl get events) (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems (6) Investigating storage issues: PVC pending, mount failures (7) Analyzing node problems: NotReady, resource pressure, taints (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds (9) Performance analysis and resource optimization (10) Security vulnerability assessment and RBAC validation

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/kubernetes-troubleshooting

Metadata

Additional technical details for this skill

author: cluster-skills
version: 1.0.0

SKILL.md

Kubernetes / OpenShift Troubleshooting Guide

Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring.

Current Versions & Tools (January 2026)

Platform	Version	Key Changes
Kubernetes	1.31.x	Sidecar containers GA, Pod lifecycle improvements
OpenShift	4.17.x	OVN-Kubernetes default, enhanced web terminal
EKS	1.31	Pod Identity, Auto Mode, Karpenter 1.x
AKS	1.31	Cilium CNI, Workload Identity GA
GKE	1.31	Autopilot improvements, Gateway API GA

Troubleshooting Tools

Tool	Install	Purpose
k9s	`brew install k9s`	Terminal UI
stern	`brew install stern`	Multi-pod log tailing
kubectx/kubens	`brew install kubectx`	Context switching
kubectl-node-shell	`kubectl krew install node-shell`	Node access

Command Usage Convention

IMPORTANT: This skill uses kubectl as the primary command. When working with:

OpenShift/ARO clusters: Replace kubectl with oc
Standard Kubernetes (AKS, EKS, GKE): Use kubectl as shown

Cluster Health Scoring (Popeye-Style)

Health scores range from 0-100. Issues reduce the score based on severity:

BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
INFO (Informational): -5 points - Non-critical issues, optimization opportunities

Quick Cluster Health Assessment

bash

#!/bin/bash
# cluster-health-check.sh
echo "=== CLUSTER HEALTH ASSESSMENT ==="

# 1. Node Health (Critical)
echo "### NODE HEALTH ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \
  echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy"

# 2. Pod Issues (Critical)
echo -e "\n### POD HEALTH ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
    echo "WARN: $POD_ISSUES pods not running"
    kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
    echo "✓ All pods running"
fi

# 3. Security (Critical)
echo -e "\n### SECURITY ASSESSMENT ###"
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers"

# 4. Resource Configuration (Warning)
echo -e "\n### RESOURCE CONFIGURATION ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits"

# 5. Storage (Warning)
echo -e "\n### STORAGE HEALTH ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
[ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound"

# OpenShift: Cluster Operators
if command -v oc &> /dev/null; then
    echo -e "\n### OPENSHIFT OPERATORS ###"
    DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
    [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy"
fi

Quick Diagnostic Commands

bash

# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide

# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'

# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}

# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}

# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous

# Multi-pod log streaming
stern -n ${NAMESPACE} ${POD_PREFIX}
stern -A -l app=${APP_NAME} --since 1h

# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}

# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes

Pod Status Interpretation

Pod Phase States

Phase	Meaning	Action
`Pending`	Not scheduled or pulling images	Check events, node resources, PVC status
`Running`	At least one container running	Check container statuses if issues
`Succeeded`	All containers completed successfully	Normal for Jobs
`Failed`	All containers terminated, at least one failed	Check logs, exit codes
`Unknown`	Cannot determine state	Node communication issue

Container Waiting States

Reason	Cause	Resolution
`ContainerCreating`	Setting up container	Check events, volume mounts
`ImagePullBackOff`	Cannot pull image	Verify image name, registry access, credentials
`ErrImagePull`	Image pull failed	Check image exists, network, ImagePullSecrets
`CreateContainerConfigError`	Config error	Check ConfigMaps, Secrets exist
`CrashLoopBackOff`	Container repeatedly crashing	Check `logs --previous`, fix application

Container Exit Codes

Exit Code	Signal	Cause	Resolution
0	-	Normal exit	Expected for Jobs
1	-	Application error	Check logs for stack trace
126	-	Command not executable	Fix permissions
127	-	Command not found	Fix command path
137	SIGKILL	OOM or forced termination	Increase memory limit
143	SIGTERM	Graceful shutdown	Normal during updates

Event Analysis

Critical Events to Monitor

Scheduling Events

Event	Meaning	Resolution
`FailedScheduling`	Cannot place pod	Check node resources, taints, affinity
`Unschedulable`	No suitable node	Add nodes, adjust requirements

FailedScheduling Messages:

"Insufficient cpu"           → Reduce requests or add capacity
"Insufficient memory"        → Reduce requests or add capacity
"node(s) had taint"          → Add toleration or remove taint
"node(s) didn't match selector" → Fix nodeSelector/affinity
"persistentvolumeclaim not found" → Create PVC or fix name

Image Events

Event	Meaning	Resolution
`BackOff`	Repeated pull failures	Check image name, registry, auth
`ErrImageNeverPull`	Image not local	Change imagePullPolicy or pre-pull

ImagePullBackOff Diagnosis:

bash

# Check image name
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'

# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}

Volume Events

Event	Meaning	Resolution
`FailedMount`	Cannot mount volume	Check PVC, storage class
`FailedAttachVolume`	Cannot attach	Check cloud provider, volume exists

PVC Pending Diagnosis:

bash

kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
kubectl get storageclass
kubectl get pv

Log Analysis Patterns

Common Error Patterns

bash

# Search for errors
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"

# Java OOM
java.lang.OutOfMemoryError → Increase memory, tune JVM heap

# Connection refused
ECONNREFUSED, Connection refused → Dependency not available

# DNS failure
ENOTFOUND, getaddrinfo → DNS resolution failed, check service name

# Permission denied
Permission denied → Check securityContext, runAsUser, fsGroup

Memory Issues (OOMKilled)

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

→ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. For JVM: Set -Xmx < container limit (leave ~25% headroom)

Node Troubleshooting

Node Conditions

Condition	Status	Meaning
`Ready`	True	Node healthy
`Ready`	False	Kubelet not healthy
`Ready`	Unknown	No heartbeat
`MemoryPressure`	True	Low memory
`DiskPressure`	True	Low disk space
`PIDPressure`	True	Too many processes

Node NotReady Diagnosis

bash

kubectl describe node ${NODE_NAME}

# On the node (SSH or debug)
systemctl status kubelet
journalctl -u kubelet -f

# Check resources
df -h
free -m
top

Networking Troubleshooting

DNS Issues

bash

# Test DNS resolution
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \
  nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service Connectivity

bash

# Verify service and endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}

# Test from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}

Ingress/Route Issues

bash

# Check Ingress
kubectl describe ingress ${INGRESS} -n ${NS}

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get pods -n openshift-ingress

OpenShift-Specific Troubleshooting

Cluster Operators

bash

# Check overall health
oc get clusteroperators

# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator

Security Context Constraints (SCC)

bash

# List SCCs
oc get scc

# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc

# Common error fix
# "unable to validate against any security context constraint"
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}

Build Failures

bash

# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
oc logs build/${BUILD} -n ${NS}

Cloud Provider Troubleshooting

EKS (AWS)

bash

aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status'
aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni
eksctl get nodegroup --cluster ${CLUSTER}

AKS (Azure)

bash

az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState
az aks check-network outbound --resource-group ${RG} --name ${CLUSTER}

GKE (Google Cloud)

bash

gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)'
gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10

Diagnostic Decision Tree

Pod Not Starting

Pod Phase = Pending?
├── Yes → Check Scheduling
│   ├── "Insufficient cpu/memory" → Add nodes or reduce requests
│   ├── "node(s) had taint" → Add toleration
│   ├── "PVC not found" → Create PVC
│   └── No events → Check API server
│
└── No → Check Container Status
    ├── ImagePullBackOff → Fix image name/auth
    ├── CrashLoopBackOff → Check logs --previous
    ├── CreateContainerConfigError → Fix ConfigMap/Secret
    └── Running but not ready → Check readiness probe

Application Not Responding

Can reach Service?
├── No → Check Service
│   ├── No endpoints → Fix selector labels
│   ├── Wrong port → Fix targetPort
│   └── NetworkPolicy blocking → Adjust policy
│
└── Yes → Check Pod
    ├── Probe failing → Fix probe or application
    ├── High latency → Check resources, dependencies
    └── Errors in logs → Fix application

Performance Analysis

Resource Optimization

bash

# Compare usage vs requests
kubectl top pods -n ${NS}

kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory

# Find pods without limits
kubectl get pods -A -o json | jq -r \
  '.items[] | select(.spec.containers[].resources.limits == null) |
   "\(.metadata.namespace)/\(.metadata.name)"'

Right-Sizing Recommendations

Symptom	Indication	Action
CPU throttling	CPU limit too low	Increase CPU limit
OOMKilled	Memory limit too low	Increase memory limit
Low utilization	Over-provisioned	Reduce requests

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/kubernetes-troubleshooting
License: MIT License

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Kubernetes / OpenShift Troubleshooting Guide

Current Versions & Tools (January 2026)

Troubleshooting Tools

Command Usage Convention

Cluster Health Scoring (Popeye-Style)

Quick Cluster Health Assessment

Quick Diagnostic Commands

Pod Status Interpretation

Pod Phase States

Container Waiting States

Container Exit Codes

Event Analysis

Critical Events to Monitor

Scheduling Events

Image Events

Volume Events

Log Analysis Patterns

Common Error Patterns

Memory Issues (OOMKilled)

Node Troubleshooting

Node Conditions

Node NotReady Diagnosis

Networking Troubleshooting

DNS Issues

Service Connectivity

Ingress/Route Issues

OpenShift-Specific Troubleshooting

Cluster Operators

Security Context Constraints (SCC)

Build Failures

Cloud Provider Troubleshooting

EKS (AWS)

AKS (Azure)

GKE (Google Cloud)

Diagnostic Decision Tree

Pod Not Starting

Application Not Responding

Performance Analysis

Resource Optimization

Right-Sizing Recommendations