Agent skill

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

View SKILL.md on GitHub Repository

Stars 333

Forks 51

Install this agent skill to your Project

npx add-skill https://github.com/ancoleman/ai-design-components/tree/main/skills/model-serving

SKILL.md

Model Serving

Purpose

Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.

When to Use

Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
Building AI APIs with streaming responses
Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
Implementing RAG pipelines with vector databases
Optimizing inference throughput and latency
Integrating LLM serving with frontend chat interfaces

Model Serving Selection

LLM Serving Engines

vLLM (Recommended Primary)

PagedAttention memory management (20-30x throughput improvement)
Continuous batching for dynamic request handling
OpenAI-compatible API endpoints
Use for: Most self-hosted LLM deployments

TensorRT-LLM

Maximum GPU efficiency (2-8x faster than vLLM)
Requires model conversion and optimization
Use for: Production workloads needing absolute maximum throughput

Ollama

Local development without GPUs
Simple CLI interface
Use for: Prototyping, laptop development, educational purposes

Decision Framework:

Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed

ML Model Serving (Non-LLM)

BentoML (Recommended)

Python-native, easy deployment
Adaptive batching for throughput
Multi-framework support (scikit-learn, PyTorch, XGBoost)
Use for: Most traditional ML model deployments

Triton Inference Server

Multi-model serving on same GPU
Model ensembles (chain multiple models)
Use for: NVIDIA GPU optimization, serving 10+ models

LLM Orchestration

LangChain

General-purpose workflows, agents, RAG
100+ integrations (LLMs, vector DBs, tools)
Use for: Most RAG and agent applications

LlamaIndex

RAG-focused with advanced retrieval strategies
100+ data connectors (PDF, Notion, web)
Use for: RAG is primary use case

Quick Start Examples

vLLM Server Setup

bash

# Install
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --port 8000

Key Parameters:

--dtype: Model precision (auto, float16, bfloat16)
--max-model-len: Context window size
--gpu-memory-utilization: GPU memory fraction (0.8-0.95)
--tensor-parallel-size: Number of GPUs for model parallelism

Streaming Responses (SSE Pattern)

Backend (FastAPI):

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

Frontend (React):

typescript

// Integration with ai-chat skill
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}

BentoML Service

python

import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]

LangChain RAG Pipeline

python

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load and chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="docs"
)

# Create retrieval chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
result = qa_chain({"query": "What is PagedAttention?"})

Performance Optimization

GPU Memory Estimation

Rule of thumb for LLMs:

GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2

Examples:

Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)

Quantization reduces memory:

FP16: 2 bytes per parameter
INT8: 1 byte per parameter (2x memory reduction)
INT4: 0.5 bytes per parameter (4x memory reduction)

vLLM Optimization

bash

# Enable quantization (AWQ for 4-bit)
vllm serve TheBloke/Llama-3.1-8B-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.9

# Multi-GPU deployment (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9

Batching Strategies

Continuous batching (vLLM default):

Dynamically adds/removes requests from batch
Higher throughput than static batching
No configuration needed

Adaptive batching (BentoML):

python

@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML automatically batches requests
    return self.model.predict(np.array(inputs))

Production Deployment

Kubernetes Deployment

See examples/k8s-vllm-deployment/ for complete YAML manifests.

Key considerations:

GPU resource requests: nvidia.com/gpu: 1
Health checks: /health endpoint
Horizontal Pod Autoscaling based on queue depth
Persistent volume for model caching

API Gateway Pattern

For production, add rate limiting, authentication, and monitoring:

Kong Configuration:

yaml

services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 60 requests per minute per API key
      - name: key-auth
      - name: prometheus

Monitoring Metrics

Essential LLM metrics:

Tokens per second (throughput)
Time to first token (TTFT)
Inter-token latency
GPU utilization and memory
Queue depth

Prometheus instrumentation:

python

from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response

Integration Patterns

Frontend (ai-chat) Integration

This skill provides the backend serving layer for the ai-chat skill.

Flow:

Frontend (React) → API Gateway → vLLM Server → GPU Inference
     ↑                                                  ↓
     └─────────── SSE Stream (tokens) ─────────────────┘

See references/streaming-sse.md for complete implementation patterns.

RAG with Vector Databases

Architecture:

User Query → LangChain
              ├─> Vector DB (Qdrant) for retrieval
              ├─> Combine context + query
              └─> LLM (vLLM) for generation

See references/langchain-orchestration.md and examples/langchain-rag-qdrant/ for complete patterns.

Async Inference Queue

For batch processing or non-real-time inference:

Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB

Useful for:

Batch document processing
Background summarization
Non-interactive workflows

Benchmarking

Use scripts/benchmark_inference.py to measure the deployment:

bash

python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000

Outputs:

Requests per second
P50/P95/P99 latency
Tokens per second
GPU memory usage

Bundled Resources

Detailed Guides:

references/vllm.md - vLLM setup, PagedAttention, optimization
references/tgi.md - Text Generation Inference patterns
references/bentoml.md - BentoML deployment patterns
references/langchain-orchestration.md - LangChain RAG and agents
references/inference-optimization.md - Quantization, batching, GPU tuning

Working Examples:

examples/vllm-serving/ - Complete vLLM + FastAPI streaming setup
examples/ollama-local/ - Local development with Ollama
examples/langchain-agents/ - LangChain agent patterns

Utility Scripts:

scripts/benchmark_inference.py - Throughput and latency benchmarking
scripts/validate_model_config.py - Validate deployment configurations

Common Patterns

Migration from OpenAI API

vLLM provides OpenAI-compatible endpoints for easy migration:

python

# Before (OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After (vLLM)
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Same API calls work!
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}]
)

Multi-Model Serving

Route requests to different models based on task:

python

MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # Fast, cheap
    "large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
    "code": "codellama/CodeLlama-34b-Instruct"    # Code-specific
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # Route to appropriate vLLM instance

Cost Optimization

Track token usage:

python

import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k

# Compare costs
openai_cost = estimate_cost(text, "gpt-4o", 0.005)  # $5 per 1M tokens
self_hosted_cost = 0  # Fixed GPU cost, unlimited tokens

Troubleshooting

Out of GPU memory:

Reduce --max-model-len
Lower --gpu-memory-utilization (try 0.8)
Enable quantization (--quantization awq)
Use smaller model variant

Low throughput:

Increase --gpu-memory-utilization (try 0.95)
Enable continuous batching (vLLM default)
Check GPU utilization (should be >80%)
Consider tensor parallelism for multi-GPU

High latency:

Reduce batch size if using static batching
Check network latency to GPU server
Profile with scripts/benchmark_inference.py

Next Steps

Local Development: Start with examples/ollama-local/ for GPU-free testing
Production Setup: Deploy vLLM with examples/vllm-serving/
RAG Integration: Add vector DB with examples/langchain-rag-qdrant/
Kubernetes: Scale with examples/k8s-vllm-deployment/
Monitoring: Add metrics with Prometheus and Grafana

Maintainer

ancoleman Core maintainer

Source details

Full Name: ancoleman/ai-design-components
Branch: main
Path in repo: skills/model-serving
License: MIT License
Topics: claude-code anthropic claude typescript skills claude-skills react design-system ui-components full-stack ai-design

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

ancoleman/ai-design-components

designing-sdks

Design production-ready SDKs with retry logic, error handling, pagination, and multi-language support. Use when building client libraries for APIs or creating developer-facing SDK interfaces.

333 51

Explore

ancoleman/ai-design-components

administering-linux

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

333 51

Explore

ancoleman/ai-design-components

implementing-api-patterns

API design and implementation across REST, GraphQL, gRPC, and tRPC patterns. Use when building backend services, public APIs, or service-to-service communication. Covers REST frameworks (FastAPI, Axum, Gin, Hono), GraphQL libraries (Strawberry, async-graphql, gqlgen, Pothos), gRPC (Tonic, Connect-Go), tRPC for TypeScript, pagination strategies (cursor-based, offset-based), rate limiting, caching, versioning, and OpenAPI documentation generation. Includes frontend integration patterns for forms, tables, dashboards, and ai-chat skills.

333 51

Explore

ancoleman/ai-design-components

prompt-engineering

Engineer effective LLM prompts using zero-shot, few-shot, chain-of-thought, and structured output techniques. Use when building LLM applications requiring reliable outputs, implementing RAG systems, creating AI agents, or optimizing prompt quality and cost. Covers OpenAI, Anthropic, and open-source models with multi-language examples (Python/TypeScript).

333 51

Explore

ancoleman/ai-design-components

deploying-applications

Deployment patterns from Kubernetes to serverless and edge functions. Use when deploying applications, setting up CI/CD, or managing infrastructure. Covers Kubernetes (Helm, ArgoCD), serverless (Vercel, Lambda), edge (Cloudflare Workers, Deno), IaC (Pulumi, OpenTofu, SST), and GitOps patterns.

333 51

Explore

ancoleman/ai-design-components

optimizing-costs

Optimize cloud infrastructure costs through FinOps practices, commitment discounts, right-sizing, and automated cost management. Use when reducing cloud spend, implementing budget controls, or establishing cost visibility across AWS, Azure, GCP, and Kubernetes environments.

333 51

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Model Serving

Purpose

When to Use

Model Serving Selection

LLM Serving Engines

ML Model Serving (Non-LLM)

LLM Orchestration

Quick Start Examples

vLLM Server Setup

Streaming Responses (SSE Pattern)

BentoML Service

LangChain RAG Pipeline

Performance Optimization

GPU Memory Estimation

vLLM Optimization

Batching Strategies

Production Deployment

Kubernetes Deployment

API Gateway Pattern

Monitoring Metrics

Integration Patterns

Frontend (ai-chat) Integration

RAG with Vector Databases

Async Inference Queue

Benchmarking

Bundled Resources

Common Patterns

Migration from OpenAI API

Multi-Model Serving

Cost Optimization

Troubleshooting

Next Steps

Recommended Agent Skills

designing-sdks

administering-linux

implementing-api-patterns

prompt-engineering

deploying-applications

optimizing-costs