Agent skills
kubernetes-operations

Agent skill

kubernetes-operations

Kubernetes and OpenShift cluster operations, maintenance, and lifecycle management. Use this skill when: (1) Performing cluster upgrades (K8s, OCP, EKS, GKE, AKS) (2) Backup and disaster recovery (etcd, Velero, cluster state) (3) Node management: drain, cordon, scaling, replacement (4) Capacity planning and cluster scaling (5) Certificate rotation and management (6) etcd maintenance and health checks (7) Resource quota and limit range management (8) Namespace lifecycle management (9) Cluster migration and workload portability (10) Monitoring and alerting configuration (11) Log aggregation setup (12) Cost optimization and resource rightsizing

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/kubernetes-operations-kcns008-cluster-skills

Metadata

Additional technical details for this skill

author: cluster-skills
version: 1.0.0

SKILL.md

Kubernetes / OpenShift Cluster Operations

Day-2 operations, maintenance, and lifecycle management for production clusters.

Current Versions & Documentation (January 2026)

Platform	Current Version	Upgrade Path	Documentation
Kubernetes	1.31.x	1.30 → 1.31	https://kubernetes.io/docs/tasks/administer-cluster/
OpenShift	4.17.x	4.16 → 4.17	https://docs.openshift.com/container-platform/4.17/
EKS	1.31	Rolling updates	https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
AKS	1.31	Blue-green or rolling	https://learn.microsoft.com/azure/aks/upgrade-cluster
GKE	1.31	Surge upgrades	https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster

Key Tools & Versions

Tool	Version	Install	Purpose
kubeadm	1.31.x	Package manager	Cluster bootstrap
Velero	1.15.x	Helm/CLI	Backup & restore
kube-prometheus-stack	v67.x	Helm	Monitoring
VPA	1.3.x	kubectl apply	Vertical scaling
Cluster Autoscaler	1.31.x	Helm	Node autoscaling
Karpenter	1.1.x	Helm	AWS node provisioning

Command Usage Convention

IMPORTANT: This skill uses kubectl as the primary command. When working with:

OpenShift/ARO clusters: Replace kubectl with oc
Standard Kubernetes (AKS, EKS, GKE): Use kubectl as shown

Node Operations

Node Lifecycle

bash

# View node status
kubectl get nodes -o wide

# Detailed node info
kubectl describe node ${NODE_NAME}

# Check node resources
kubectl top nodes

# Node labels and taints
kubectl get nodes --show-labels
kubectl describe node ${NODE} | grep -A 5 Taints

Drain and Cordon

bash

# Cordon: Mark node unschedulable (no new pods)
kubectl cordon ${NODE_NAME}

# Drain: Evict pods safely
kubectl drain ${NODE_NAME} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=300s

# Force drain (use with caution)
kubectl drain ${NODE_NAME} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=30

# Uncordon: Allow scheduling again
kubectl uncordon ${NODE_NAME}

Cluster Autoscaler Configuration

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
          command:
            - ./cluster-autoscaler
            - --v=4
            - --cloud-provider=${CLOUD_PROVIDER}
            - --nodes=${MIN}:${MAX}:${NODE_GROUP}
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --scale-down-utilization-threshold=0.5
            - --skip-nodes-with-local-storage=false
            - --skip-nodes-with-system-pods=true
            - --balance-similar-node-groups=true

Backup and Recovery

etcd Backup

bash

# Backup etcd (run on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

Velero Backup (v1.15.x)

bash

# Install Velero CLI
brew install velero

# Install Velero server with AWS provider
velero install \
  --provider aws \
  --bucket ${BUCKET_NAME} \
  --secret-file ./credentials-velero \
  --backup-location-config region=${REGION} \
  --snapshot-location-config region=${REGION} \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --use-node-agent

# Create backup
velero backup create ${BACKUP_NAME} \
  --include-namespaces ${NAMESPACES} \
  --ttl 720h \
  --default-volumes-to-fs-backup

# Create scheduled backup
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces ${NAMESPACES} \
  --ttl 168h

# Restore from backup
velero restore create --from-backup ${BACKUP_NAME}

Velero Backup Manifest

yaml

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: ${BACKUP_NAME}
  namespace: velero
spec:
  includedNamespaces:
    - ${NAMESPACE_1}
    - ${NAMESPACE_2}
  excludedResources:
    - events
    - events.events.k8s.io
  storageLocation: default
  volumeSnapshotLocations:
    - default
  ttl: 720h0m0s
  snapshotVolumes: true
  hooks:
    resources:
      - name: backup-hook
        includedNamespaces:
          - ${NAMESPACE}
        labelSelector:
          matchLabels:
            app: database
        pre:
          - exec:
              container: database
              command:
                - /bin/sh
                - -c
                - "pg_dump -U postgres > /backup/pre-backup.sql"
              onError: Fail
              timeout: 120s

Cluster Upgrades

Pre-Upgrade Checklist

bash

#!/bin/bash
# pre-upgrade-check.sh

echo "=== Cluster Version ==="
kubectl version --short

echo -e "\n=== Node Status ==="
kubectl get nodes

echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

echo -e "\n=== PDBs That May Block Drain ==="
kubectl get pdb -A

echo -e "\n=== Pending PVCs ==="
kubectl get pvc -A --field-selector=status.phase=Pending

echo -e "\n=== Deprecated APIs in Use ==="
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

AKS Upgrade (Azure)

bash

# Check current version and available upgrades
az aks get-versions --location ${LOCATION} -o table
az aks get-upgrades --resource-group ${RG} --name ${CLUSTER} -o table

# Upgrade control plane and node pools
az aks upgrade --resource-group ${RG} --name ${CLUSTER} \
  --kubernetes-version 1.31.0

# Use blue-green upgrade with max surge
az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \
  --name ${NODEPOOL} --kubernetes-version 1.31.0 \
  --max-surge 33%

# Enable auto-upgrade channel
az aks update --resource-group ${RG} --name ${CLUSTER} \
  --auto-upgrade-channel stable

EKS Upgrade

bash

# Update control plane
aws eks update-cluster-version \
  --name ${CLUSTER_NAME} \
  --kubernetes-version 1.31

# Wait for completion
aws eks wait cluster-active --name ${CLUSTER_NAME}

# Update EKS add-ons
for addon in vpc-cni coredns kube-proxy eks-pod-identity-agent; do
  aws eks update-addon --cluster-name ${CLUSTER_NAME} \
    --addon-name $addon \
    --resolve-conflicts PRESERVE
done

# Update managed node groups
aws eks update-nodegroup-version \
  --cluster-name ${CLUSTER_NAME} \
  --nodegroup-name ${NODEGROUP_NAME}

GKE Upgrade

bash

# Check available versions
gcloud container get-server-config --region ${REGION}

# Upgrade control plane
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
  --master --cluster-version 1.31

# Upgrade node pools
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
  --node-pool ${POOL} \
  --cluster-version 1.31

# Enable release channel
gcloud container clusters update ${CLUSTER} --region ${REGION} \
  --release-channel regular

OpenShift Upgrade

bash

# Check available updates
oc adm upgrade

# View current version and channel
oc get clusterversion
oc get clusterversion version -o jsonpath='{.spec.channel}'

# Change channel
oc adm upgrade channel stable-4.17

# Start upgrade
oc adm upgrade --to-latest
# OR upgrade to specific version
oc adm upgrade --to=4.17.5

# Monitor upgrade progress
watch -n 10 'oc get clusterversion && oc get clusteroperators'

Resource Management

Resource Quotas

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    persistentvolumeclaims: "10"
    requests.storage: 100Gi

Limit Ranges

yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: ${NAMESPACE}
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi

Check Resource Usage

bash

# Namespace resource usage vs quota
kubectl describe quota -n ${NAMESPACE}

# Pod resource usage
kubectl top pods -n ${NAMESPACE} --sort-by=memory
kubectl top pods -n ${NAMESPACE} --sort-by=cpu

# Node resource allocation
kubectl describe nodes | grep -A 5 "Allocated resources"

Certificate Management

Check Certificate Expiry

bash

# kubeadm certificates
kubeadm certs check-expiration

# Manual check
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Check all certs
for cert in /etc/kubernetes/pki/*.crt; do
  echo "=== $cert ==="
  openssl x509 -in $cert -noout -dates
done

Rotate Certificates

bash

# Renew all certificates (kubeadm)
kubeadm certs renew all

# Restart control plane components
crictl pods --name kube-apiserver -q | xargs crictl stopp
crictl pods --name kube-controller-manager -q | xargs crictl stopp
crictl pods --name kube-scheduler -q | xargs crictl stopp

Monitoring Setup

Prometheus Stack (kube-prometheus-stack v67.x)

bash

# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.replicas=2 \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set alertmanager.alertmanagerSpec.replicas=3 \
  --set grafana.persistence.enabled=true

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Custom ServiceMonitor

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ${APP_NAME}
  namespace: monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - ${NAMESPACE}
  selector:
    matchLabels:
      app.kubernetes.io/name: ${APP_NAME}
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Cost Optimization

VerticalPodAutoscaler

yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ${APP_NAME}-vpa
  namespace: ${NAMESPACE}
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ${APP_NAME}
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

Namespace Lifecycle

Namespace Template

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    app.kubernetes.io/managed-by: cluster-skills
    environment: ${ENVIRONMENT}
    team: ${TEAM}
  annotations:
    owner: ${OWNER_EMAIL}
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: default-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: ${NAMESPACE}
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Disaster Recovery

Full Cluster Recovery Checklist

Restore etcd - See etcd restore section

Verify Control Plane

bash

kubectl get nodes
kubectl get pods -n kube-system
kubectl cluster-info

Restore Workloads (Velero)

bash

velero restore create --from-backup ${BACKUP_NAME}

Verify Application Health
bash
```
kubectl get pods -A
kubectl get svc -A
```

Verify DNS and Networking

bash

kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/kubernetes-operations-kcns008-cluster-skills
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Kubernetes / OpenShift Cluster Operations

Current Versions & Documentation (January 2026)

Key Tools & Versions

Command Usage Convention

Node Operations

Node Lifecycle

Drain and Cordon

Cluster Autoscaler Configuration

Backup and Recovery

etcd Backup

Velero Backup (v1.15.x)

Velero Backup Manifest

Cluster Upgrades

Pre-Upgrade Checklist

AKS Upgrade (Azure)

EKS Upgrade

GKE Upgrade

OpenShift Upgrade

Resource Management

Resource Quotas

Limit Ranges

Check Resource Usage

Certificate Management

Check Certificate Expiry

Rotate Certificates

Monitoring Setup

Prometheus Stack (kube-prometheus-stack v67.x)

Custom ServiceMonitor

Cost Optimization

VerticalPodAutoscaler

Namespace Lifecycle

Namespace Template

Disaster Recovery

Full Cluster Recovery Checklist

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state