Agent skill

Kubernetes AI Expert

Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/kubernetes-ai-expert

SKILL.md

Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

Resource Description
nvidia.com/gpu: N Request N GPUs
nvidia.com/mig-3g.40gb: 1 MIG slice
Node selector nvidia.com/gpu.product
Toleration nvidia.com/gpu

Full manifests: resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

Framework Best For GPU Support Scaling
vLLM High-throughput LLMs Excellent HPA/KEDA
Triton Multi-model serving Excellent HPA
TGI HuggingFace models Good HPA

vLLM Deployment

Key configurations:

  • --tensor-parallel-size - Multi-GPU inference
  • --max-model-len - Context window
  • --gpu-memory-utilization - Memory efficiency

Triton Inference Server

  • Multi-model serving from S3/GCS
  • HTTP (8000), gRPC (8001), Metrics (8002)
  • Model polling for dynamic updates

Text Generation Inference (TGI)

  • HuggingFace native
  • Quantization support (bitsandbytes-nf4)
  • Simple deployment

Deployment manifests: resources/manifests.yaml

Helm Chart Pattern

yaml
# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

  • GPU utilization (DCGM_FI_DEV_GPU_UTIL)
  • Inference queue length
  • Custom metrics

KEDA Event-Driven Scaling

Scale on:

  • Prometheus metrics
  • Message queue depth (RabbitMQ, SQS)
  • Custom external metrics

HPA/KEDA configs: resources/manifests.yaml

Networking

Ingress Configuration

  • Rate limiting (nginx annotations)
  • TLS with cert-manager
  • Large body size for AI payloads
  • Extended timeouts (300s+)

Network Policies

  • Restrict pod-to-pod communication
  • Allow only gateway → inference
  • Permit DNS egress

Monitoring

Key Metrics

Metric Source Purpose
GPU Utilization DCGM Exporter Scaling
Inference Latency Prometheus SLO
Tokens/Second Custom Throughput
Queue Length App metrics Scaling

Setup

bash
# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

  • Instance types: g5.2xlarge, p4d.24xlarge
  • AMI: AL2_x86_64_GPU
  • GPU taints for isolation

Azure AKS

  • VM sizes: Standard_NC*, Standard_ND*
  • A100 support via NC24ads_A100_v4

OCI OKE

  • Shapes: BM.GPU.A100-v2.8, VM.GPU.A10
  • GPU node pools with taints

Terraform examples: ../terraform-iac/resources/modules.tf

Best Practices

Resource Management

  • Always set GPU limits = requests
  • Use node selectors for GPU types
  • Implement tolerations for GPU taints
  • PVC for model caching

High Availability

  • Multiple replicas across zones
  • Pod disruption budgets
  • Readiness/liveness probes

Cost Optimization

  • Spot instances for dev/test
  • Auto-scaling to zero when idle
  • Right-size GPU instances

Resources


Deploy AI workloads at scale with GPU-optimized Kubernetes.

Didn't find tool you were looking for?

Be as detailed as possible for better results