Agent skills
Kubernetes AI Expert

Agent skill

Kubernetes AI Expert

Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns

View SKILL.md on GitHub Repository

Stars 1

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/frankxai/ai-architect/tree/main/skills/kubernetes-ai

SKILL.md

Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

bash

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

Resource	Description
`nvidia.com/gpu: N`	Request N GPUs
`nvidia.com/mig-3g.40gb: 1`	MIG slice
Node selector	`nvidia.com/gpu.product`
Toleration	`nvidia.com/gpu`

Full manifests: resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

Framework	Best For	GPU Support	Scaling
vLLM	High-throughput LLMs	Excellent	HPA/KEDA
Triton	Multi-model serving	Excellent	HPA
TGI	HuggingFace models	Good	HPA

vLLM Deployment

Key configurations:

--tensor-parallel-size - Multi-GPU inference
--max-model-len - Context window
--gpu-memory-utilization - Memory efficiency

Triton Inference Server

Multi-model serving from S3/GCS
HTTP (8000), gRPC (8001), Metrics (8002)
Model polling for dynamic updates

Text Generation Inference (TGI)

HuggingFace native
Quantization support (bitsandbytes-nf4)
Simple deployment

Deployment manifests: resources/manifests.yaml

Helm Chart Pattern

yaml

# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

GPU utilization (DCGM_FI_DEV_GPU_UTIL)
Inference queue length
Custom metrics

KEDA Event-Driven Scaling

Scale on:

Prometheus metrics
Message queue depth (RabbitMQ, SQS)
Custom external metrics

HPA/KEDA configs: resources/manifests.yaml

Networking

Ingress Configuration

Rate limiting (nginx annotations)
TLS with cert-manager
Large body size for AI payloads
Extended timeouts (300s+)

Network Policies

Restrict pod-to-pod communication
Allow only gateway → inference
Permit DNS egress

Monitoring

Key Metrics

Metric	Source	Purpose
GPU Utilization	DCGM Exporter	Scaling
Inference Latency	Prometheus	SLO
Tokens/Second	Custom	Throughput
Queue Length	App metrics	Scaling

Setup

bash

# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

Instance types: g5.2xlarge, p4d.24xlarge
AMI: AL2_x86_64_GPU
GPU taints for isolation

Azure AKS

VM sizes: Standard_NC*, Standard_ND*
A100 support via NC24ads_A100_v4

OCI OKE

Shapes: BM.GPU.A100-v2.8, VM.GPU.A10
GPU node pools with taints

Terraform examples: ../terraform-iac/resources/modules.tf

Best Practices

Resource Management

Always set GPU limits = requests
Use node selectors for GPU types
Implement tolerations for GPU taints
PVC for model caching

High Availability

Multiple replicas across zones
Pod disruption budgets
Readiness/liveness probes

Cost Optimization

Spot instances for dev/test
Auto-scaling to zero when idle
Right-size GPU instances

Resources

Deploy AI workloads at scale with GPU-optimized Kubernetes.

Maintainer

frankxai Core maintainer

Source details

Full Name: frankxai/ai-architect
Branch: main
Path in repo: skills/kubernetes-ai
Topics: architecture ai-architect ai-patterns enterprise-ai oracle systems-design

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

frankxai/ai-architect

GenAI DAC Specialist

Expert in OCI Generative AI Dedicated AI Clusters - deployment, fine-tuning, optimization, and production operations

1 0

Explore

frankxai/ai-architect

Oracle Agent Spec Expert

Design framework-agnostic AI agents using Oracle's Open Agent Specification for portable, interoperable agentic systems with JSON/YAML definitions

1 0

Explore

frankxai/ai-architect

AI Security Expert

Enterprise AI security - OWASP LLM Top 10, prompt injection defense, guardrails, PII protection

1 0

Explore

frankxai/ai-architect

OCI Services Expert

Expert guidance on Oracle Cloud Infrastructure services, cloud architecture patterns, cost optimization, deployment strategies, and OCI best practices for enterprise solutions

1 0

Explore

frankxai/ai-architect

agentic-orchestration

Patterns for multi-agent coordination, task decomposition, handoffs, and workflow orchestration. Best practices for building and managing agent systems.

1 0

Explore

frankxai/ai-architect

nvidia-nim

NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud

1 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Kubernetes AI Expert

GPU Workload Scheduling

NVIDIA GPU Operator

GPU Resource Requests

Model Serving Frameworks

Framework Comparison

vLLM Deployment

Triton Inference Server

Text Generation Inference (TGI)

Helm Chart Pattern

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

KEDA Event-Driven Scaling

Networking

Ingress Configuration

Network Policies

Monitoring

Key Metrics

Setup

Managed Kubernetes

AWS EKS

Azure AKS

OCI OKE

Best Practices

Resource Management

High Availability

Cost Optimization

Resources

Recommended Agent Skills

GenAI DAC Specialist

Oracle Agent Spec Expert

AI Security Expert

OCI Services Expert

agentic-orchestration

nvidia-nim