Agent skill

skills

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/local-ai-infrastructure/skills

SKILL.md

Hardware Sizing Skill

Overview

Expertise in calculating and specifying hardware requirements for local AI deployments, including GPU selection, server configuration, storage, and network planning based on workload characteristics and team size.

Key Capabilities

  • GPU selection and sizing for LLM inference
  • Server configuration for AI workloads
  • Storage planning for models and data
  • Network bandwidth calculations
  • TCO modeling for hardware investments
  • Capacity planning and growth projections

GPU Selection Guide

NVIDIA GPU Comparison

GPU VRAM FP16 TFLOPS Bandwidth TDP Price (approx) Best For
RTX 4090 24GB 82.6 1 TB/s 450W $1,600 Small teams, dev
RTX A6000 48GB 38.7 768 GB/s 300W $4,500 Medium teams
A100 40GB 40GB 77.9 1.5 TB/s 400W $10,000 Production
A100 80GB 80GB 77.9 2.0 TB/s 400W $15,000 Large models
H100 80GB 80GB 267 3.35 TB/s 700W $30,000 Maximum perf
L40S 48GB 91.6 864 GB/s 350W $8,000 Balanced

Model VRAM Requirements

Model Size FP16 INT8 INT4/AWQ Example Models
7B 14GB 8GB 4GB Qwen-Next (small variant), GLM-4.6 (small variant)
13B 26GB 14GB 8GB Qwen-Next (mid variant), MiniMax-M2 (mid variant)
34B 68GB 36GB 18GB Qwen-Next / GLM-4.6 (large-ish variants)
70B 140GB 75GB 38GB Qwen-Next / GLM-4.6 / MiniMax-M2 (largest variants)
110B 220GB 115GB 58GB Frontier-scale variants (verify availability + license)

VRAM Formula:

VRAM Required = (Parameters × Bytes per Parameter) + Context Window Overhead
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter
- INT4: 0.5 bytes per parameter
- Context overhead: ~2GB for 8K context, ~8GB for 32K context

GPU Sizing by Team Size

Team Size Usage Level Model Size Recommended GPU Quantity
1-5 Dev/Test 7B-13B RTX 4090 1
5-15 Production 13B-34B RTX 4090 or A6000 1-2
15-30 Production 34B-70B A100 40GB 2
30-75 Production 70B A100 80GB 2-4
75-150 Enterprise 70B+ H100 or A100 4-8
150+ Enterprise 70B+ H100 cluster 8+

Server Configuration Templates

Small Team Server (5-15 developers)

yaml
# Small team AI server specification
server:
  type: Tower or 2U Rack

cpu:
  model: AMD EPYC 7343 or Intel Xeon Gold 5315Y
  cores: 16
  threads: 32

memory:
  type: DDR4-3200 ECC
  capacity: 128GB
  channels: 8

gpu:
  model: NVIDIA RTX 4090
  count: 1-2
  vram_total: 24-48GB
  nvlink: false

storage:
  system:
    type: NVMe SSD
    capacity: 500GB
    raid: None
  models:
    type: NVMe SSD
    capacity: 2TB
    raid: None
  logs:
    type: SATA SSD
    capacity: 2TB
    raid: 1

network:
  type: 10GbE
  ports: 2
  bonding: Active/Standby

power:
  psu: 1200W
  redundancy: Single (N)
  ups: Recommended

estimated_cost:
  hardware: $10,000 - $15,000
  annual_power: $1,500
  annual_maintenance: $1,000

Medium Team Server (15-50 developers)

yaml
# Medium team AI server specification
server:
  type: 2U Rack Mount

cpu:
  model: AMD EPYC 7543 or Intel Xeon Platinum 8358
  cores: 32
  threads: 64

memory:
  type: DDR4-3200 ECC
  capacity: 256GB
  channels: 8

gpu:
  model: NVIDIA A6000 or RTX 4090
  count: 2-4
  vram_total: 96-192GB
  nvlink: Recommended for A6000

storage:
  system:
    type: NVMe SSD
    capacity: 1TB
    raid: 1
  models:
    type: NVMe SSD
    capacity: 4TB
    raid: 0
  logs:
    type: SAS SSD
    capacity: 4TB
    raid: 10

network:
  type: 25GbE
  ports: 2
  bonding: LACP

power:
  psu: 2000W
  redundancy: Redundant (N+1)
  ups: Required

estimated_cost:
  hardware: $35,000 - $60,000
  annual_power: $4,000
  annual_maintenance: $3,000

Enterprise Server (50-200 developers)

yaml
# Enterprise AI server specification
server:
  type: 4U Rack Mount or DGX-style

cpu:
  model: 2x AMD EPYC 9354 or Intel Xeon Platinum 8480+
  cores: 64 total
  threads: 128

memory:
  type: DDR5-4800 ECC
  capacity: 512GB - 1TB
  channels: 12-16

gpu:
  model: NVIDIA A100 80GB or H100
  count: 4-8
  vram_total: 320-640GB
  nvlink: Required (NVSwitch for 8+ GPUs)

storage:
  system:
    type: NVMe SSD
    capacity: 2TB
    raid: 1
  models:
    type: NVMe SSD
    capacity: 8TB
    raid: 0 or 10
  logs:
    type: NVMe SSD
    capacity: 8TB
    raid: 10
  backup:
    type: SAS HDD
    capacity: 32TB
    raid: 6

network:
  type: 100GbE or InfiniBand
  ports: 2-4
  bonding: LACP

power:
  psu: 3000W+
  redundancy: Redundant (N+N)
  ups: Required with generator backup

estimated_cost:
  hardware: $150,000 - $400,000
  annual_power: $15,000 - $30,000
  annual_maintenance: $10,000 - $20,000

Capacity Planning

Request Volume Estimation

Developer Usage Requests/Day Tokens/Request Daily Tokens
Light (occasional) 20-30 2,000 40K-60K
Medium (regular) 50-100 3,000 150K-300K
Heavy (power user) 150-250 4,000 600K-1M
Intensive (AI-first) 300-500 5,000 1.5M-2.5M

Throughput Calculation

# Calculate required throughput

Daily Requests = Team Size × Requests per User per Day
Peak Factor = 0.1 (10% of daily load in peak hour)
Peak Requests per Minute = (Daily Requests × Peak Factor) / 60

Tokens per Request = Avg Input Tokens + Avg Output Tokens
Peak Tokens per Second = Peak Requests per Minute × Tokens per Request / 60

# Example: 50 medium-usage developers
Daily Requests = 50 × 100 = 5,000
Peak Requests/min = 5,000 × 0.1 / 60 = 8.3
Tokens/Request = 2,000 + 1,000 = 3,000
Peak Tokens/sec = 8.3 × 3,000 / 60 = 415 tok/s

GPU Throughput Reference

GPU Model Size Throughput (tok/s) Concurrent Requests
RTX 4090 7B 100-150 8-12
RTX 4090 13B 50-80 4-8
A100 40GB 13B 120-180 16-24
A100 40GB 34B 60-100 8-16
A100 80GB 70B 40-70 4-8
H100 80GB 70B 100-150 8-16
2x A100 80GB 70B (TP=2) 80-140 8-16

Sizing Formula

Required GPUs = Peak Tokens/sec / Single GPU Throughput × Safety Factor

Safety Factor = 1.3 (30% headroom for spikes)

# Example: 415 tok/s needed for 70B model
Single A100 80GB throughput = 55 tok/s average
Required GPUs = 415 / 55 × 1.3 = 9.8 → 10 A100 80GB

# OR with 2-GPU tensor parallel:
TP=2 throughput = 110 tok/s
Required TP pairs = 415 / 110 × 1.3 = 4.9 → 5 pairs (10 GPUs)

Storage Planning

Model Storage Requirements

Model Size Weights (FP16) Weights (INT4) With Tokenizer
7B 14GB 4GB +500MB
13B 26GB 7GB +500MB
34B 68GB 18GB +500MB
70B 140GB 38GB +500MB
100B+ 200GB+ 50GB+ +1GB

Storage Architecture

yaml
storage_tiers:
  tier1_hot:  # Active models
    type: NVMe SSD
    iops: 500K+
    latency: <0.1ms
    purpose: Currently loaded models, active inference
    sizing: 2x largest model size

  tier2_warm:  # Standby models
    type: SATA SSD or NVMe
    iops: 50K+
    latency: <1ms
    purpose: Quick-loading alternate models
    sizing: 5-10x model sizes for model library

  tier3_cold:  # Archives
    type: HDD or object storage
    purpose: Model version history, backups
    sizing: 3x warm storage for versioning

log_storage:
  type: SSD (fast write)
  sizing: |
    Daily logs = Requests/day × 2KB average
    Monthly = Daily × 30
    Retention storage = Monthly × Retention months

Network Planning

Bandwidth Requirements

Component Traffic Type Bandwidth Need
API Requests Client → Server 1-10 Mbps per concurrent user
Responses Server → Client 5-50 Mbps per concurrent user
Model Loading Storage → GPU 10+ Gbps (reduces load time)
Monitoring Server → Collector 10-100 Mbps
Replication Server → Backup Varies by backup frequency

Network Architecture

┌─────────────────────────────────────────────────────┐
│                  Corporate Network                   │
│                     (10 GbE)                         │
└──────────────────────┬──────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────┐
│                  Load Balancer                       │
│               (25-100 GbE uplink)                    │
└──────────────────────┬──────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │                           │
    ┌────┴────┐                 ┌────┴────┐
    │ AI Node │ ◄──(25 GbE)──► │ AI Node │
    │   #1    │                 │   #2    │
    └────┬────┘                 └────┬────┘
         │                           │
         └─────────────┬─────────────┘
                       │
              ┌────────┴────────┐
              │  Storage Array  │
              │   (100 GbE)     │
              └─────────────────┘

TCO Calculation

Hardware TCO Template

markdown
## 3-Year Total Cost of Ownership

### Capital Expenditure (CapEx)
| Item | Unit Cost | Quantity | Total |
|------|-----------|----------|-------|
| Server (compute) | $15,000 | 2 | $30,000 |
| GPUs (A100 80GB) | $15,000 | 4 | $60,000 |
| Storage (NVMe) | $500/TB | 8TB | $4,000 |
| Network equipment | $5,000 | 1 | $5,000 |
| Installation | $2,000 | 1 | $2,000 |
| **CapEx Total** | | | **$101,000** |

### Operating Expenses (OpEx) - Annual
| Item | Monthly | Annual |
|------|---------|--------|
| Power (3kW average) | $400 | $4,800 |
| Cooling | $100 | $1,200 |
| Maintenance/support | $500 | $6,000 |
| Hosting/colocation | $1,000 | $12,000 |
| Admin labor (0.25 FTE) | $2,500 | $30,000 |
| **Annual OpEx** | **$4,500** | **$54,000** |

### 3-Year TCO
| Year | CapEx | OpEx | Cumulative |
|------|-------|------|------------|
| Year 1 | $101,000 | $54,000 | $155,000 |
| Year 2 | $0 | $54,000 | $209,000 |
| Year 3 | $0 | $54,000 | $263,000 |

### Per-Request Cost (at 500K requests/month)
Year 1: $155,000 / 6M requests = $0.026/request
Year 3: $263,000 / 18M requests = $0.015/request (amortized)

Cloud API Cost Comparison

markdown
## Local vs Cloud Cost Comparison

### Assumptions
- 50 developers, medium usage
- 100 requests/dev/day = 5,000 requests/day
- 3,000 tokens/request average
- 15M tokens/day = 450M tokens/month

### Cloud API Costs (GPT-4o-mini pricing)
- Input: $0.15/1M tokens × 150M = $22.50/month
- Output: $0.60/1M tokens × 300M = $180/month
- Total: ~$200/month = $2,400/year

### Cloud API Costs (GPT-4o pricing)
- Input: $2.50/1M tokens × 150M = $375/month
- Output: $10.00/1M tokens × 300M = $3,000/month
- Total: ~$3,375/month = $40,500/year

### Local Large Model (Qwen-Next / MiniMax-M2 / GLM-4.6)
- Year 1 TCO: $155,000
- Equivalent cloud cost: $40,500/year
- Breakeven: 3.8 years

### With Data Sovereignty Premium
If data can't go to cloud, local is only option.
Value of data sovereignty: Priceless / Required

Scaling Strategy

Horizontal Scaling Triggers

Metric Add Capacity When Scale Strategy
GPU Utilization >80% sustained Add GPU or node
Queue Depth >10 requests sustained Add replica
P95 Latency >5s sustained Add GPU for parallelism
Memory Pressure >90% VRAM Larger GPU or quantization

Vertical Scaling Path

Stage 1: Single RTX 4090 (24GB)
   ↓ Need more VRAM
Stage 2: Single A6000 (48GB)
   ↓ Need more throughput
Stage 3: 2x A6000 with tensor parallel
   ↓ Need larger models
Stage 4: 2x A100 80GB
   ↓ Need more throughput
Stage 5: 4x A100 80GB with NVLink
   ↓ Need maximum performance
Stage 6: 8x H100 with NVSwitch

Best Practices

Procurement

  1. Budget 20% contingency for unexpected needs
  2. Test before bulk purchase with single unit
  3. Consider used enterprise GPUs (A100s at 50% cost)
  4. Plan for 3-year lifecycle (hardware depreciation)
  5. Include installation and training in budget

Deployment

  1. Start small, scale up - validate before expanding
  2. Keep 30% headroom for traffic spikes
  3. Plan upgrade path before initial deployment
  4. Document all specifications for future reference

Monitoring

  1. Track utilization trends weekly
  2. Plan capacity 3-6 months ahead
  3. Review TCO quarterly against cloud alternatives
  4. Update sizing models with actual usage data

This skill ensures organizations size hardware appropriately for their AI workloads, optimizing for both performance and cost-effectiveness.

Didn't find tool you were looking for?

Be as detailed as possible for better results