Hardware Sizing Skill

Overview

Expertise in calculating and specifying hardware requirements for local AI deployments, including GPU selection, server configuration, storage, and network planning based on workload characteristics and team size.

Key Capabilities

GPU selection and sizing for LLM inference
Server configuration for AI workloads
Storage planning for models and data
Network bandwidth calculations
TCO modeling for hardware investments
Capacity planning and growth projections

GPU Selection Guide

NVIDIA GPU Comparison

GPU	VRAM	FP16 TFLOPS	Bandwidth	TDP	Price (approx)	Best For
RTX 4090	24GB	82.6	1 TB/s	450W	$1,600	Small teams, dev
RTX A6000	48GB	38.7	768 GB/s	300W	$4,500	Medium teams
A100 40GB	40GB	77.9	1.5 TB/s	400W	$10,000	Production
A100 80GB	80GB	77.9	2.0 TB/s	400W	$15,000	Large models
H100 80GB	80GB	267	3.35 TB/s	700W	$30,000	Maximum perf
L40S	48GB	91.6	864 GB/s	350W	$8,000	Balanced

Model VRAM Requirements

Model Size	FP16	INT8	INT4/AWQ	Example Models
7B	14GB	8GB	4GB	Qwen-Next (small variant), GLM-4.6 (small variant)
13B	26GB	14GB	8GB	Qwen-Next (mid variant), MiniMax-M2 (mid variant)
34B	68GB	36GB	18GB	Qwen-Next / GLM-4.6 (large-ish variants)
70B	140GB	75GB	38GB	Qwen-Next / GLM-4.6 / MiniMax-M2 (largest variants)
110B	220GB	115GB	58GB	Frontier-scale variants (verify availability + license)

VRAM Formula:

VRAM Required = (Parameters × Bytes per Parameter) + Context Window Overhead
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter
- INT4: 0.5 bytes per parameter
- Context overhead: ~2GB for 8K context, ~8GB for 32K context

GPU Sizing by Team Size

Team Size	Usage Level	Model Size	Recommended GPU	Quantity
1-5	Dev/Test	7B-13B	RTX 4090	1
5-15	Production	13B-34B	RTX 4090 or A6000	1-2
15-30	Production	34B-70B	A100 40GB	2
30-75	Production	70B	A100 80GB	2-4
75-150	Enterprise	70B+	H100 or A100	4-8
150+	Enterprise	70B+	H100 cluster	8+

Server Configuration Templates

Small Team Server (5-15 developers)

yaml

# Small team AI server specification
server:
  type: Tower or 2U Rack

cpu:
  model: AMD EPYC 7343 or Intel Xeon Gold 5315Y
  cores: 16
  threads: 32

memory:
  type: DDR4-3200 ECC
  capacity: 128GB
  channels: 8

gpu:
  model: NVIDIA RTX 4090
  count: 1-2
  vram_total: 24-48GB
  nvlink: false

storage:
  system:
    type: NVMe SSD
    capacity: 500GB
    raid: None
  models:
    type: NVMe SSD
    capacity: 2TB
    raid: None
  logs:
    type: SATA SSD
    capacity: 2TB
    raid: 1

network:
  type: 10GbE
  ports: 2
  bonding: Active/Standby

power:
  psu: 1200W
  redundancy: Single (N)
  ups: Recommended

estimated_cost:
  hardware: $10,000 - $15,000
  annual_power: $1,500
  annual_maintenance: $1,000

Medium Team Server (15-50 developers)

yaml

# Medium team AI server specification
server:
  type: 2U Rack Mount

cpu:
  model: AMD EPYC 7543 or Intel Xeon Platinum 8358
  cores: 32
  threads: 64

memory:
  type: DDR4-3200 ECC
  capacity: 256GB
  channels: 8

gpu:
  model: NVIDIA A6000 or RTX 4090
  count: 2-4
  vram_total: 96-192GB
  nvlink: Recommended for A6000

storage:
  system:
    type: NVMe SSD
    capacity: 1TB
    raid: 1
  models:
    type: NVMe SSD
    capacity: 4TB
    raid: 0
  logs:
    type: SAS SSD
    capacity: 4TB
    raid: 10

network:
  type: 25GbE
  ports: 2
  bonding: LACP

power:
  psu: 2000W
  redundancy: Redundant (N+1)
  ups: Required

estimated_cost:
  hardware: $35,000 - $60,000
  annual_power: $4,000
  annual_maintenance: $3,000

Enterprise Server (50-200 developers)

yaml

# Enterprise AI server specification
server:
  type: 4U Rack Mount or DGX-style

cpu:
  model: 2x AMD EPYC 9354 or Intel Xeon Platinum 8480+
  cores: 64 total
  threads: 128

memory:
  type: DDR5-4800 ECC
  capacity: 512GB - 1TB
  channels: 12-16

gpu:
  model: NVIDIA A100 80GB or H100
  count: 4-8
  vram_total: 320-640GB
  nvlink: Required (NVSwitch for 8+ GPUs)

storage:
  system:
    type: NVMe SSD
    capacity: 2TB
    raid: 1
  models:
    type: NVMe SSD
    capacity: 8TB
    raid: 0 or 10
  logs:
    type: NVMe SSD
    capacity: 8TB
    raid: 10
  backup:
    type: SAS HDD
    capacity: 32TB
    raid: 6

network:
  type: 100GbE or InfiniBand
  ports: 2-4
  bonding: LACP

power:
  psu: 3000W+
  redundancy: Redundant (N+N)
  ups: Required with generator backup

estimated_cost:
  hardware: $150,000 - $400,000
  annual_power: $15,000 - $30,000
  annual_maintenance: $10,000 - $20,000

Capacity Planning

Request Volume Estimation

Developer Usage	Requests/Day	Tokens/Request	Daily Tokens
Light (occasional)	20-30	2,000	40K-60K
Medium (regular)	50-100	3,000	150K-300K
Heavy (power user)	150-250	4,000	600K-1M
Intensive (AI-first)	300-500	5,000	1.5M-2.5M

Throughput Calculation

# Calculate required throughput

Daily Requests = Team Size × Requests per User per Day
Peak Factor = 0.1 (10% of daily load in peak hour)
Peak Requests per Minute = (Daily Requests × Peak Factor) / 60

Tokens per Request = Avg Input Tokens + Avg Output Tokens
Peak Tokens per Second = Peak Requests per Minute × Tokens per Request / 60

# Example: 50 medium-usage developers
Daily Requests = 50 × 100 = 5,000
Peak Requests/min = 5,000 × 0.1 / 60 = 8.3
Tokens/Request = 2,000 + 1,000 = 3,000
Peak Tokens/sec = 8.3 × 3,000 / 60 = 415 tok/s

GPU Throughput Reference

GPU	Model Size	Throughput (tok/s)	Concurrent Requests
RTX 4090	7B	100-150	8-12
RTX 4090	13B	50-80	4-8
A100 40GB	13B	120-180	16-24
A100 40GB	34B	60-100	8-16
A100 80GB	70B	40-70	4-8
H100 80GB	70B	100-150	8-16
2x A100 80GB	70B (TP=2)	80-140	8-16

Sizing Formula

Required GPUs = Peak Tokens/sec / Single GPU Throughput × Safety Factor

Safety Factor = 1.3 (30% headroom for spikes)

# Example: 415 tok/s needed for 70B model
Single A100 80GB throughput = 55 tok/s average
Required GPUs = 415 / 55 × 1.3 = 9.8 → 10 A100 80GB

# OR with 2-GPU tensor parallel:
TP=2 throughput = 110 tok/s
Required TP pairs = 415 / 110 × 1.3 = 4.9 → 5 pairs (10 GPUs)

Storage Planning

Model Storage Requirements

Model Size	Weights (FP16)	Weights (INT4)	With Tokenizer
7B	14GB	4GB	+500MB
13B	26GB	7GB	+500MB
34B	68GB	18GB	+500MB
70B	140GB	38GB	+500MB
100B+	200GB+	50GB+	+1GB

Storage Architecture

yaml

storage_tiers:
  tier1_hot:  # Active models
    type: NVMe SSD
    iops: 500K+
    latency: <0.1ms
    purpose: Currently loaded models, active inference
    sizing: 2x largest model size

  tier2_warm:  # Standby models
    type: SATA SSD or NVMe
    iops: 50K+
    latency: <1ms
    purpose: Quick-loading alternate models
    sizing: 5-10x model sizes for model library

  tier3_cold:  # Archives
    type: HDD or object storage
    purpose: Model version history, backups
    sizing: 3x warm storage for versioning

log_storage:
  type: SSD (fast write)
  sizing: |
    Daily logs = Requests/day × 2KB average
    Monthly = Daily × 30
    Retention storage = Monthly × Retention months

Network Planning

Bandwidth Requirements

Component	Traffic Type	Bandwidth Need
API Requests	Client → Server	1-10 Mbps per concurrent user
Responses	Server → Client	5-50 Mbps per concurrent user
Model Loading	Storage → GPU	10+ Gbps (reduces load time)
Monitoring	Server → Collector	10-100 Mbps
Replication	Server → Backup	Varies by backup frequency

Network Architecture

┌─────────────────────────────────────────────────────┐
│                  Corporate Network                   │
│                     (10 GbE)                         │
└──────────────────────┬──────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────┐
│                  Load Balancer                       │
│               (25-100 GbE uplink)                    │
└──────────────────────┬──────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │                           │
    ┌────┴────┐                 ┌────┴────┐
    │ AI Node │ ◄──(25 GbE)──► │ AI Node │
    │   #1    │                 │   #2    │
    └────┬────┘                 └────┬────┘
         │                           │
         └─────────────┬─────────────┘
                       │
              ┌────────┴────────┐
              │  Storage Array  │
              │   (100 GbE)     │
              └─────────────────┘

TCO Calculation

Hardware TCO Template

markdown

## 3-Year Total Cost of Ownership

### Capital Expenditure (CapEx)
| Item | Unit Cost | Quantity | Total |
|------|-----------|----------|-------|
| Server (compute) | $15,000 | 2 | $30,000 |
| GPUs (A100 80GB) | $15,000 | 4 | $60,000 |
| Storage (NVMe) | $500/TB | 8TB | $4,000 |
| Network equipment | $5,000 | 1 | $5,000 |
| Installation | $2,000 | 1 | $2,000 |
| **CapEx Total** | | | **$101,000** |

### Operating Expenses (OpEx) - Annual
| Item | Monthly | Annual |
|------|---------|--------|
| Power (3kW average) | $400 | $4,800 |
| Cooling | $100 | $1,200 |
| Maintenance/support | $500 | $6,000 |
| Hosting/colocation | $1,000 | $12,000 |
| Admin labor (0.25 FTE) | $2,500 | $30,000 |
| **Annual OpEx** | **$4,500** | **$54,000** |

### 3-Year TCO
| Year | CapEx | OpEx | Cumulative |
|------|-------|------|------------|
| Year 1 | $101,000 | $54,000 | $155,000 |
| Year 2 | $0 | $54,000 | $209,000 |
| Year 3 | $0 | $54,000 | $263,000 |

### Per-Request Cost (at 500K requests/month)
Year 1: $155,000 / 6M requests = $0.026/request
Year 3: $263,000 / 18M requests = $0.015/request (amortized)

Cloud API Cost Comparison

markdown

## Local vs Cloud Cost Comparison

### Assumptions
- 50 developers, medium usage
- 100 requests/dev/day = 5,000 requests/day
- 3,000 tokens/request average
- 15M tokens/day = 450M tokens/month

### Cloud API Costs (GPT-4o-mini pricing)
- Input: $0.15/1M tokens × 150M = $22.50/month
- Output: $0.60/1M tokens × 300M = $180/month
- Total: ~$200/month = $2,400/year

### Cloud API Costs (GPT-4o pricing)
- Input: $2.50/1M tokens × 150M = $375/month
- Output: $10.00/1M tokens × 300M = $3,000/month
- Total: ~$3,375/month = $40,500/year

### Local Large Model (Qwen-Next / MiniMax-M2 / GLM-4.6)
- Year 1 TCO: $155,000
- Equivalent cloud cost: $40,500/year
- Breakeven: 3.8 years

### With Data Sovereignty Premium
If data can't go to cloud, local is only option.
Value of data sovereignty: Priceless / Required

Scaling Strategy

Horizontal Scaling Triggers

Metric	Add Capacity When	Scale Strategy
GPU Utilization	>80% sustained	Add GPU or node
Queue Depth	>10 requests sustained	Add replica
P95 Latency	>5s sustained	Add GPU for parallelism
Memory Pressure	>90% VRAM	Larger GPU or quantization

Vertical Scaling Path

Stage 1: Single RTX 4090 (24GB)
   ↓ Need more VRAM
Stage 2: Single A6000 (48GB)
   ↓ Need more throughput
Stage 3: 2x A6000 with tensor parallel
   ↓ Need larger models
Stage 4: 2x A100 80GB
   ↓ Need more throughput
Stage 5: 4x A100 80GB with NVLink
   ↓ Need maximum performance
Stage 6: 8x H100 with NVSwitch

Best Practices

Procurement

Budget 20% contingency for unexpected needs
Test before bulk purchase with single unit
Consider used enterprise GPUs (A100s at 50% cost)
Plan for 3-year lifecycle (hardware depreciation)
Include installation and training in budget

Deployment

Start small, scale up - validate before expanding
Keep 30% headroom for traffic spikes
Plan upgrade path before initial deployment
Document all specifications for future reference

Monitoring

Track utilization trends weekly
Plan capacity 3-6 months ahead
Review TCO quarterly against cloud alternatives
Update sizing models with actual usage data

This skill ensures organizations size hardware appropriately for their AI workloads, optimizing for both performance and cost-effectiveness.

Search AI Tools

skills

Install this agent skill to your Project

SKILL.md

Hardware Sizing Skill

Overview

Key Capabilities

GPU Selection Guide

NVIDIA GPU Comparison

Model VRAM Requirements

GPU Sizing by Team Size

Server Configuration Templates

Small Team Server (5-15 developers)

Medium Team Server (15-50 developers)

Enterprise Server (50-200 developers)

Capacity Planning

Request Volume Estimation

Throughput Calculation

GPU Throughput Reference

Sizing Formula

Storage Planning

Model Storage Requirements

Storage Architecture

Network Planning

Bandwidth Requirements

Network Architecture

TCO Calculation

Hardware TCO Template

Cloud API Cost Comparison

Scaling Strategy

Horizontal Scaling Triggers

Vertical Scaling Path

Best Practices

Procurement

Deployment

Monitoring