Agent skill

gpu-aware-training-config

GPU-aware PPO training configuration for A100/H100. Trigger when training is slow or GPU utilization is low.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/gpu-aware-training-config

SKILL.md

GPU-Aware Training Configuration

Experiment Overview

Item Details
Date 2025-12-18
Goal Fix extremely slow A100 training (FPS 4,500 vs expected 30,000-50,000)
Environment Google Colab A100, PyTorch 2.x, CUDA
Status Success - 10x+ speedup achieved

Context

Training was extremely slow on A100 Colab GPU despite using "quick_test" mode. Investigation revealed that get_auto_config(training_mode="quick_test") was returning a generic config with n_envs=256 and torch.compile=False, completely ignoring GPU capabilities.

Root Cause

The original get_auto_config() function had training modes that completely bypassed GPU detection:

python
# WRONG - ignores GPU capabilities
def get_auto_config(total_timesteps, training_mode="auto"):
    if training_mode == "quick_test":
        return NativePPOConfig(
            n_envs=256,           # Too low for A100!
            compile_policy=False,  # Missing 3-6x speedup!
            # ... generic settings
        )

Verified Solution

Training modes must layer on top of GPU-specific settings, not replace them:

python
def get_auto_config(total_timesteps=1_000_000, training_mode="auto"):
    # Step 1: ALWAYS detect GPU first
    gpu_tier = _detect_gpu_tier()  # "h100", "a100", "high", "medium", "low"

    # Step 2: Get GPU-appropriate base config
    if gpu_tier == "h100":
        config = _get_h100_base_config()
    elif gpu_tier == "a100":
        config = _get_a100_base_config()
    # ... etc

    # Step 3: Apply training mode ADJUSTMENTS (not replacements)
    if training_mode == "quick_test":
        config.total_timesteps = 10_000_000
        config.validation_interval = 25
        # BUT KEEP GPU-specific n_envs, compile_policy, etc!

GPU Configuration Matrix

GPU Tier n_envs n_steps minibatch compile FP8 Expected FPS
H100-80GB 2048 512 8192 True True 80,000-120,000
A100-80GB 2048 512 8192 True False 50,000-80,000
A100-40GB 1024 512 4096 True False 40,000-60,000
RTX 4090 1024 512 4096 True False 30,000-50,000
RTX 3090 512 512 2048 True False 20,000-35,000
Generic 256 512 2048 False False 5,000-15,000

Training Mode Adjustments

Training modes should ONLY adjust these parameters:

Mode timesteps n_epochs validation_interval Notes
quick_test 10M 10 25 Fast iteration
standard 50M 12 50 Development
production 200M 15 100 Full training
extended 500M 20 200 Maximum learning

GPU Detection Code

python
def _detect_gpu_tier() -> str:
    """Detect GPU tier for optimal configuration."""
    if not torch.cuda.is_available():
        return "cpu"

    gpu_name = torch.cuda.get_device_name(0).lower()
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

    # Check for H100 (compute capability 9.0+)
    compute_cap = torch.cuda.get_device_capability(0)
    if compute_cap[0] >= 9:
        return "h100"

    # Check for A100
    if "a100" in gpu_name:
        return "a100"

    # Tier by VRAM
    if vram_gb >= 40:
        return "high"
    elif vram_gb >= 20:
        return "medium"
    else:
        return "low"

Failed Attempts (Critical)

Attempt Why it Failed Lesson Learned
Training mode completely replaces config Lost GPU-specific optimizations Modes should layer adjustments, not replace
n_envs=256 on A100 Only 5-12% GPU utilization Need 1000+ envs for GPU saturation
compile_policy=False in quick_test Missing 3-6x speedup Always enable torch.compile on modern GPUs
Fixed config for all GPUs Wasted resources or OOM errors Detect GPU and scale accordingly
Checking GPU only in "auto" mode quick_test/standard modes got generic config ALWAYS detect GPU, regardless of mode

Diagnostic Checklist

If training is slow, check these in order:

  1. FPS < 10,000 on A100? → Check n_envs (should be 1024+)
  2. torch.compile: False? → Enable it (3-6x speedup after warmup)
  3. GPU util < 20%? → Increase n_envs
  4. Memory errors? → Decrease n_envs or minibatch_size
  5. H100 with FP8=False? → Enable FP8 for additional speedup

Key Insights

  • GPU detection must happen FIRST, before applying training modes
  • Research shows 1000+ parallel environments needed for GPU saturation
  • torch.compile provides 3-6x speedup but takes 10+ min to warmup
  • FP8 is only available on Hopper architecture (H100, compute capability 9.0+)
  • Training modes should adjust timesteps/epochs, NOT hardware-specific params

Quick Fix Command

If you see slow training on A100, the config should show:

n_envs: 1024+
torch.compile: True
compile_mode: reduce-overhead

If any of these are wrong, the get_auto_config() function isn't detecting the GPU properly.

References

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results