Agent skill
Kubernetes AI Expert
Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/kubernetes-ai-expert
SKILL.md
Kubernetes AI Expert
Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.
GPU Workload Scheduling
NVIDIA GPU Operator
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator
GPU Resource Requests
| Resource | Description |
|---|---|
nvidia.com/gpu: N |
Request N GPUs |
nvidia.com/mig-3g.40gb: 1 |
MIG slice |
| Node selector | nvidia.com/gpu.product |
| Toleration | nvidia.com/gpu |
Full manifests: resources/manifests.yaml
Model Serving Frameworks
Framework Comparison
| Framework | Best For | GPU Support | Scaling |
|---|---|---|---|
| vLLM | High-throughput LLMs | Excellent | HPA/KEDA |
| Triton | Multi-model serving | Excellent | HPA |
| TGI | HuggingFace models | Good | HPA |
vLLM Deployment
Key configurations:
--tensor-parallel-size- Multi-GPU inference--max-model-len- Context window--gpu-memory-utilization- Memory efficiency
Triton Inference Server
- Multi-model serving from S3/GCS
- HTTP (8000), gRPC (8001), Metrics (8002)
- Model polling for dynamic updates
Text Generation Inference (TGI)
- HuggingFace native
- Quantization support (
bitsandbytes-nf4) - Simple deployment
Deployment manifests: resources/manifests.yaml
Helm Chart Pattern
yaml
# values.yaml structure
inference:
enabled: true
replicas: 2
framework: "vllm" # vllm, tgi, triton
resources:
limits:
nvidia.com/gpu: 1
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
vectorDB:
enabled: true
type: "qdrant"
monitoring:
enabled: true
Auto-Scaling
Horizontal Pod Autoscaler (HPA)
Scale on:
- GPU utilization (
DCGM_FI_DEV_GPU_UTIL) - Inference queue length
- Custom metrics
KEDA Event-Driven Scaling
Scale on:
- Prometheus metrics
- Message queue depth (RabbitMQ, SQS)
- Custom external metrics
HPA/KEDA configs: resources/manifests.yaml
Networking
Ingress Configuration
- Rate limiting (nginx annotations)
- TLS with cert-manager
- Large body size for AI payloads
- Extended timeouts (300s+)
Network Policies
- Restrict pod-to-pod communication
- Allow only gateway → inference
- Permit DNS egress
Monitoring
Key Metrics
| Metric | Source | Purpose |
|---|---|---|
| GPU Utilization | DCGM Exporter | Scaling |
| Inference Latency | Prometheus | SLO |
| Tokens/Second | Custom | Throughput |
| Queue Length | App metrics | Scaling |
Setup
bash
# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter
# ServiceMonitor for Prometheus
# See resources/manifests.yaml
Managed Kubernetes
AWS EKS
- Instance types:
g5.2xlarge,p4d.24xlarge - AMI:
AL2_x86_64_GPU - GPU taints for isolation
Azure AKS
- VM sizes:
Standard_NC*,Standard_ND* - A100 support via
NC24ads_A100_v4
OCI OKE
- Shapes:
BM.GPU.A100-v2.8,VM.GPU.A10 - GPU node pools with taints
Terraform examples: ../terraform-iac/resources/modules.tf
Best Practices
Resource Management
- Always set GPU limits = requests
- Use node selectors for GPU types
- Implement tolerations for GPU taints
- PVC for model caching
High Availability
- Multiple replicas across zones
- Pod disruption budgets
- Readiness/liveness probes
Cost Optimization
- Spot instances for dev/test
- Auto-scaling to zero when idle
- Right-size GPU instances
Resources
Deploy AI workloads at scale with GPU-optimized Kubernetes.
Didn't find tool you were looking for?