Agent skills
kubernetes-debugger

Agent skill

kubernetes-debugger

Kubernetes debugging and troubleshooting best practices using MCP kubernetes tools. Use when: (1) Pods are failing, pending, or in CrashLoopBackOff/ImagePullBackOff states, (2) Services are unreachable or DNS resolution fails, (3) Deployments aren't rolling out, (4) Nodes are unhealthy or unschedulable, (5) Resource issues (OOM, CPU throttling), (6) Any "why isn't my Kubernetes workload working?" questions. Provides systematic debugging workflows using kubectl_get, kubectl_describe, kubectl_logs, exec_in_pod, and other MCP kubernetes tools.

View SKILL.md on GitHub Repository

Stars 2

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/rodrigodelmonte/k8s-debugging-plugin/tree/main/skills/kubernetes-debugger

SKILL.md

Kubernetes Debugger

Systematic debugging workflows for Kubernetes issues using MCP kubernetes tools.

Prerequisites

Install Kubernetes MCP Server

bash

claude mcp add kubernetes --scope user -- npx mcp-server-kubernetes

Requirements:

Access to a Kubernetes cluster configured for kubectl (minikube, Rancher Desktop, GKE, EKS, AKS, etc.)
kubeconfig at ~/.kube/config (default) or KUBECONFIG env var set
Helm v3 in PATH (optional, for Helm operations)

Alternative installation methods:

bash

# Global install
npm install -g mcp-server-kubernetes

# Or run directly with npx (no install)
npx mcp-server-kubernetes

Verify installation:

bash

claude mcp list  # Should show 'kubernetes' server

Quick Reference: MCP Tools

Tool	Use For
`kubectl_get`	List resources, check status, find resource names
`kubectl_describe`	Detailed info, events, conditions
`kubectl_logs`	Container stdout/stderr, application errors
`exec_in_pod`	Run commands inside containers
`kubectl_rollout`	Deployment rollout status/history
`node_management`	Cordon/drain/uncordon nodes

Debugging Decision Tree

Issue reported
    │
    ├─ Pod not running? ──────────► See: Pod Debugging Workflow
    │
    ├─ Service unreachable? ──────► See: Service/Network Debugging
    │
    ├─ Deployment stuck? ─────────► See: Deployment Debugging
    │
    ├─ Node issues? ──────────────► See: Node Debugging
    │
    └─ Performance/Resources? ────► See: Resource Debugging

Pod Debugging Workflow

Step 1: Get Pod Status

kubectl_get(resourceType="pods", namespace="<ns>")

Common statuses and their meaning:

Pending: Scheduling issues (resources, node selector, affinity)
CrashLoopBackOff: Container crashing repeatedly
ImagePullBackOff/ErrImagePull: Cannot pull container image
Running but not ready: Readiness probe failing
Terminating: Stuck deletion (finalizers, PDB)

Step 2: Check Events and Conditions

kubectl_describe(resourceType="pod", name="<pod>", namespace="<ns>")

Look for in output:

Events section: Scheduling failures, image pull errors, probe failures
Conditions: PodScheduled, Initialized, ContainersReady, Ready
Container State: Waiting (reason), Running, Terminated (exit code)

Step 3: Get Container Logs

kubectl_logs(resourceType="pod", name="<pod>", namespace="<ns>", container="<container>")

Options:

previous=true: Logs from crashed container
tail=100: Last N lines
since="1h": Logs from last hour

Step 4: Exec Into Container (if running)

exec_in_pod(name="<pod>", namespace="<ns>", command=["sh", "-c", "<cmd>"])

Useful commands:

["cat", "/etc/resolv.conf"] - Check DNS config
["env"] - Verify environment variables
["ls", "-la", "/app"] - Check mounted files
["nc", "-zv", "<host>", "<port>"] - Test connectivity

Common Pod Issues

CrashLoopBackOff

Get logs: kubectl_logs(previous=true) for crashed container
Check exit code in kubectl_describe output
Common causes:
- Exit code 1: Application error
- Exit code 137: OOMKilled (check memory limits)
- Exit code 143: SIGTERM (graceful shutdown issue)

ImagePullBackOff

Check image name/tag in describe output
Verify image exists in registry
Check imagePullSecrets if private registry
Look for "Failed to pull image" in events

Pending Pod

Check events for scheduling failure reason
Common causes:
- Insufficient cpu/memory: Node capacity exhausted
- node(s) didn't match node selector: Wrong labels
- PersistentVolumeClaim not bound: Storage issue
- 0/N nodes available: Taints/tolerations mismatch

Service/Network Debugging

Step 1: Verify Service Exists

kubectl_get(resourceType="services", namespace="<ns>")
kubectl_describe(resourceType="service", name="<svc>", namespace="<ns>")

Step 2: Check Endpoints

kubectl_get(resourceType="endpoints", name="<svc>", namespace="<ns>")

No endpoints? Check:

Pod labels match service selector
Pods are Running and Ready
Target port matches container port

Step 3: Test DNS Resolution

exec_in_pod(name="<debug-pod>", command=["nslookup", "<service>.<namespace>.svc.cluster.local"])

Step 4: Test Connectivity

exec_in_pod(name="<debug-pod>", command=["nc", "-zv", "<service>", "<port>"])

Deployment Debugging

Check Rollout Status

kubectl_rollout(subCommand="status", resourceType="deployment", name="<deploy>", namespace="<ns>")

View Rollout History

kubectl_rollout(subCommand="history", resourceType="deployment", name="<deploy>", namespace="<ns>")

Rollback if Needed

kubectl_rollout(subCommand="undo", resourceType="deployment", name="<deploy>", namespace="<ns>")

Common Issues

Progressing stuck: New pods failing (check ReplicaSet pods)
Available < desired: Pods not passing readiness probes
Surge/unavailable conflicts: Check deployment strategy

Node Debugging

Check Node Status

kubectl_get(resourceType="nodes")
kubectl_describe(resourceType="node", name="<node>")

Node Conditions to Check

Condition	Problem If
Ready	False or Unknown
MemoryPressure	True
DiskPressure	True
PIDPressure	True
NetworkUnavailable	True

Drain Node for Maintenance

node_management(operation="cordon", nodeName="<node>")  # Prevent new pods
node_management(operation="drain", nodeName="<node>", confirmDrain=true)  # Evict pods
# After maintenance:
node_management(operation="uncordon", nodeName="<node>")

Resource Debugging

Check Resource Usage

kubectl_generic(command="top", resourceType="pods", namespace="<ns>")
kubectl_generic(command="top", resourceType="nodes")

OOMKilled Detection

kubectl_describe pod - look for "OOMKilled" in container state
Check memory limits vs actual usage
Solutions:
- Increase memory limits
- Fix memory leak in application
- Add memory requests for better scheduling

CPU Throttling

Check if CPU limits are too restrictive
Consider removing CPU limits (keep requests)
Use kubectl top pods to see actual usage

Reference Files

references/pod-states.md: Complete pod state reference
references/common-errors.md: Error messages and solutions
references/network-debug.md: Network troubleshooting details

Maintainer

rodrigodelmonte Core maintainer

Source details

Full Name: rodrigodelmonte/k8s-debugging-plugin
Branch: main
Path in repo: skills/kubernetes-debugger
Topics: claude-code claude debbuging kubernetes troubleshooting

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298

Explore

davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298

Explore

davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298

Explore

davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298

Explore

davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298

Explore

davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Kubernetes Debugger

Prerequisites

Install Kubernetes MCP Server

Quick Reference: MCP Tools

Debugging Decision Tree

Pod Debugging Workflow

Step 1: Get Pod Status

Step 2: Check Events and Conditions

Step 3: Get Container Logs

Step 4: Exec Into Container (if running)

Common Pod Issues

CrashLoopBackOff

ImagePullBackOff

Pending Pod

Service/Network Debugging

Step 1: Verify Service Exists

Step 2: Check Endpoints

Step 3: Test DNS Resolution

Step 4: Test Connectivity

Deployment Debugging

Check Rollout Status

View Rollout History

Rollback if Needed

Common Issues

Node Debugging

Check Node Status

Node Conditions to Check

Drain Node for Maintenance

Resource Debugging

Check Resource Usage

OOMKilled Detection

CPU Throttling

Reference Files

Recommended Agent Skills

verl-rl-training

openrlhf-training

gguf-quantization

Claude Code Guide

qdrant-vector-search

behavioral-modes