Agent skill

kubernetes-operations

Kubernetes and OpenShift cluster operations, maintenance, and lifecycle management. Use this skill when: (1) Performing cluster upgrades (K8s, OCP, EKS, GKE, AKS) (2) Backup and disaster recovery (etcd, Velero, cluster state) (3) Node management: drain, cordon, scaling, replacement (4) Capacity planning and cluster scaling (5) Certificate rotation and management (6) etcd maintenance and health checks (7) Resource quota and limit range management (8) Namespace lifecycle management (9) Cluster migration and workload portability (10) Monitoring and alerting configuration (11) Log aggregation setup (12) Cost optimization and resource rightsizing

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/kubernetes-operations-kcns008-cluster-skills

Metadata

Additional technical details for this skill

author
cluster-skills
version
1.0.0

SKILL.md

Kubernetes / OpenShift Cluster Operations

Day-2 operations, maintenance, and lifecycle management for production clusters.

Current Versions & Documentation (January 2026)

Platform Current Version Upgrade Path Documentation
Kubernetes 1.31.x 1.30 → 1.31 https://kubernetes.io/docs/tasks/administer-cluster/
OpenShift 4.17.x 4.16 → 4.17 https://docs.openshift.com/container-platform/4.17/
EKS 1.31 Rolling updates https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
AKS 1.31 Blue-green or rolling https://learn.microsoft.com/azure/aks/upgrade-cluster
GKE 1.31 Surge upgrades https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster

Key Tools & Versions

Tool Version Install Purpose
kubeadm 1.31.x Package manager Cluster bootstrap
Velero 1.15.x Helm/CLI Backup & restore
kube-prometheus-stack v67.x Helm Monitoring
VPA 1.3.x kubectl apply Vertical scaling
Cluster Autoscaler 1.31.x Helm Node autoscaling
Karpenter 1.1.x Helm AWS node provisioning

Command Usage Convention

IMPORTANT: This skill uses kubectl as the primary command. When working with:

  • OpenShift/ARO clusters: Replace kubectl with oc
  • Standard Kubernetes (AKS, EKS, GKE): Use kubectl as shown

Node Operations

Node Lifecycle

bash
# View node status
kubectl get nodes -o wide

# Detailed node info
kubectl describe node ${NODE_NAME}

# Check node resources
kubectl top nodes

# Node labels and taints
kubectl get nodes --show-labels
kubectl describe node ${NODE} | grep -A 5 Taints

Drain and Cordon

bash
# Cordon: Mark node unschedulable (no new pods)
kubectl cordon ${NODE_NAME}

# Drain: Evict pods safely
kubectl drain ${NODE_NAME} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=300s

# Force drain (use with caution)
kubectl drain ${NODE_NAME} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=30

# Uncordon: Allow scheduling again
kubectl uncordon ${NODE_NAME}

Cluster Autoscaler Configuration

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
          command:
            - ./cluster-autoscaler
            - --v=4
            - --cloud-provider=${CLOUD_PROVIDER}
            - --nodes=${MIN}:${MAX}:${NODE_GROUP}
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --scale-down-utilization-threshold=0.5
            - --skip-nodes-with-local-storage=false
            - --skip-nodes-with-system-pods=true
            - --balance-similar-node-groups=true

Backup and Recovery

etcd Backup

bash
# Backup etcd (run on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table

Velero Backup (v1.15.x)

bash
# Install Velero CLI
brew install velero

# Install Velero server with AWS provider
velero install \
  --provider aws \
  --bucket ${BUCKET_NAME} \
  --secret-file ./credentials-velero \
  --backup-location-config region=${REGION} \
  --snapshot-location-config region=${REGION} \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --use-node-agent

# Create backup
velero backup create ${BACKUP_NAME} \
  --include-namespaces ${NAMESPACES} \
  --ttl 720h \
  --default-volumes-to-fs-backup

# Create scheduled backup
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces ${NAMESPACES} \
  --ttl 168h

# Restore from backup
velero restore create --from-backup ${BACKUP_NAME}

Velero Backup Manifest

yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: ${BACKUP_NAME}
  namespace: velero
spec:
  includedNamespaces:
    - ${NAMESPACE_1}
    - ${NAMESPACE_2}
  excludedResources:
    - events
    - events.events.k8s.io
  storageLocation: default
  volumeSnapshotLocations:
    - default
  ttl: 720h0m0s
  snapshotVolumes: true
  hooks:
    resources:
      - name: backup-hook
        includedNamespaces:
          - ${NAMESPACE}
        labelSelector:
          matchLabels:
            app: database
        pre:
          - exec:
              container: database
              command:
                - /bin/sh
                - -c
                - "pg_dump -U postgres > /backup/pre-backup.sql"
              onError: Fail
              timeout: 120s

Cluster Upgrades

Pre-Upgrade Checklist

bash
#!/bin/bash
# pre-upgrade-check.sh

echo "=== Cluster Version ==="
kubectl version --short

echo -e "\n=== Node Status ==="
kubectl get nodes

echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

echo -e "\n=== PDBs That May Block Drain ==="
kubectl get pdb -A

echo -e "\n=== Pending PVCs ==="
kubectl get pvc -A --field-selector=status.phase=Pending

echo -e "\n=== Deprecated APIs in Use ==="
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

AKS Upgrade (Azure)

bash
# Check current version and available upgrades
az aks get-versions --location ${LOCATION} -o table
az aks get-upgrades --resource-group ${RG} --name ${CLUSTER} -o table

# Upgrade control plane and node pools
az aks upgrade --resource-group ${RG} --name ${CLUSTER} \
  --kubernetes-version 1.31.0

# Use blue-green upgrade with max surge
az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \
  --name ${NODEPOOL} --kubernetes-version 1.31.0 \
  --max-surge 33%

# Enable auto-upgrade channel
az aks update --resource-group ${RG} --name ${CLUSTER} \
  --auto-upgrade-channel stable

EKS Upgrade

bash
# Update control plane
aws eks update-cluster-version \
  --name ${CLUSTER_NAME} \
  --kubernetes-version 1.31

# Wait for completion
aws eks wait cluster-active --name ${CLUSTER_NAME}

# Update EKS add-ons
for addon in vpc-cni coredns kube-proxy eks-pod-identity-agent; do
  aws eks update-addon --cluster-name ${CLUSTER_NAME} \
    --addon-name $addon \
    --resolve-conflicts PRESERVE
done

# Update managed node groups
aws eks update-nodegroup-version \
  --cluster-name ${CLUSTER_NAME} \
  --nodegroup-name ${NODEGROUP_NAME}

GKE Upgrade

bash
# Check available versions
gcloud container get-server-config --region ${REGION}

# Upgrade control plane
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
  --master --cluster-version 1.31

# Upgrade node pools
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
  --node-pool ${POOL} \
  --cluster-version 1.31

# Enable release channel
gcloud container clusters update ${CLUSTER} --region ${REGION} \
  --release-channel regular

OpenShift Upgrade

bash
# Check available updates
oc adm upgrade

# View current version and channel
oc get clusterversion
oc get clusterversion version -o jsonpath='{.spec.channel}'

# Change channel
oc adm upgrade channel stable-4.17

# Start upgrade
oc adm upgrade --to-latest
# OR upgrade to specific version
oc adm upgrade --to=4.17.5

# Monitor upgrade progress
watch -n 10 'oc get clusterversion && oc get clusteroperators'

Resource Management

Resource Quotas

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    persistentvolumeclaims: "10"
    requests.storage: 100Gi

Limit Ranges

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: ${NAMESPACE}
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi

Check Resource Usage

bash
# Namespace resource usage vs quota
kubectl describe quota -n ${NAMESPACE}

# Pod resource usage
kubectl top pods -n ${NAMESPACE} --sort-by=memory
kubectl top pods -n ${NAMESPACE} --sort-by=cpu

# Node resource allocation
kubectl describe nodes | grep -A 5 "Allocated resources"

Certificate Management

Check Certificate Expiry

bash
# kubeadm certificates
kubeadm certs check-expiration

# Manual check
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Check all certs
for cert in /etc/kubernetes/pki/*.crt; do
  echo "=== $cert ==="
  openssl x509 -in $cert -noout -dates
done

Rotate Certificates

bash
# Renew all certificates (kubeadm)
kubeadm certs renew all

# Restart control plane components
crictl pods --name kube-apiserver -q | xargs crictl stopp
crictl pods --name kube-controller-manager -q | xargs crictl stopp
crictl pods --name kube-scheduler -q | xargs crictl stopp

Monitoring Setup

Prometheus Stack (kube-prometheus-stack v67.x)

bash
# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.replicas=2 \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set alertmanager.alertmanagerSpec.replicas=3 \
  --set grafana.persistence.enabled=true

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Custom ServiceMonitor

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ${APP_NAME}
  namespace: monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - ${NAMESPACE}
  selector:
    matchLabels:
      app.kubernetes.io/name: ${APP_NAME}
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Cost Optimization

VerticalPodAutoscaler

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ${APP_NAME}-vpa
  namespace: ${NAMESPACE}
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ${APP_NAME}
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

Namespace Lifecycle

Namespace Template

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    app.kubernetes.io/managed-by: cluster-skills
    environment: ${ENVIRONMENT}
    team: ${TEAM}
  annotations:
    owner: ${OWNER_EMAIL}
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: default-quota
  namespace: ${NAMESPACE}
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: ${NAMESPACE}
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Disaster Recovery

Full Cluster Recovery Checklist

  1. Restore etcd - See etcd restore section
  2. Verify Control Plane
    bash
    kubectl get nodes
    kubectl get pods -n kube-system
    kubectl cluster-info
    
  3. Restore Workloads (Velero)
    bash
    velero restore create --from-backup ${BACKUP_NAME}
    
  4. Verify Application Health
    bash
    kubectl get pods -A
    kubectl get svc -A
    
  5. Verify DNS and Networking
    bash
    kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes
    

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results