Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

Resource	Description
`nvidia.com/gpu: N`	Request N GPUs
`nvidia.com/mig-3g.40gb: 1`	MIG slice
Node selector	`nvidia.com/gpu.product`
Toleration	`nvidia.com/gpu`

Full manifests: resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

Framework	Best For	GPU Support	Scaling
vLLM	High-throughput LLMs	Excellent	HPA/KEDA
Triton	Multi-model serving	Excellent	HPA
TGI	HuggingFace models	Good	HPA

vLLM Deployment

Key configurations:

--tensor-parallel-size - Multi-GPU inference
--max-model-len - Context window
--gpu-memory-utilization - Memory efficiency

Triton Inference Server

Multi-model serving from S3/GCS
HTTP (8000), gRPC (8001), Metrics (8002)
Model polling for dynamic updates

Text Generation Inference (TGI)

HuggingFace native
Quantization support (bitsandbytes-nf4)
Simple deployment

Deployment manifests: resources/manifests.yaml

Helm Chart Pattern

yaml

# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

GPU utilization (DCGM_FI_DEV_GPU_UTIL)
Inference queue length
Custom metrics

KEDA Event-Driven Scaling

Scale on:

Prometheus metrics
Message queue depth (RabbitMQ, SQS)
Custom external metrics

HPA/KEDA configs: resources/manifests.yaml

Networking

Ingress Configuration

Rate limiting (nginx annotations)
TLS with cert-manager
Large body size for AI payloads
Extended timeouts (300s+)

Network Policies

Restrict pod-to-pod communication
Allow only gateway → inference
Permit DNS egress

Monitoring

Key Metrics

Metric	Source	Purpose
GPU Utilization	DCGM Exporter	Scaling
Inference Latency	Prometheus	SLO
Tokens/Second	Custom	Throughput
Queue Length	App metrics	Scaling

Setup

bash

# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

Instance types: g5.2xlarge, p4d.24xlarge
AMI: AL2_x86_64_GPU
GPU taints for isolation

Azure AKS

VM sizes: Standard_NC*, Standard_ND*
A100 support via NC24ads_A100_v4

OCI OKE

Shapes: BM.GPU.A100-v2.8, VM.GPU.A10
GPU node pools with taints

Terraform examples: ../terraform-iac/resources/modules.tf

Best Practices

Resource Management

Always set GPU limits = requests
Use node selectors for GPU types
Implement tolerations for GPU taints
PVC for model caching

High Availability

Multiple replicas across zones
Pod disruption budgets
Readiness/liveness probes

Cost Optimization

Spot instances for dev/test
Auto-scaling to zero when idle
Right-size GPU instances

Resources

Deploy AI workloads at scale with GPU-optimized Kubernetes.

Search AI Tools

Kubernetes AI Expert

Install this agent skill to your Project

SKILL.md

Kubernetes AI Expert

GPU Workload Scheduling

NVIDIA GPU Operator

GPU Resource Requests

Model Serving Frameworks

Framework Comparison

vLLM Deployment

Triton Inference Server

Text Generation Inference (TGI)

Helm Chart Pattern

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

KEDA Event-Driven Scaling

Networking

Ingress Configuration

Network Policies

Monitoring

Key Metrics

Setup

Managed Kubernetes

AWS EKS

Azure AKS

OCI OKE

Best Practices

Resource Management

High Availability

Cost Optimization

Resources