Agent skills
gpu-aware-training-config

Agent skill

gpu-aware-training-config

GPU-aware PPO training configuration for A100/H100. Trigger when training is slow or GPU utilization is low.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/gpu-aware-training-config

SKILL.md

GPU-Aware Training Configuration

Experiment Overview

Item	Details
Date	2025-12-18
Goal	Fix extremely slow A100 training (FPS 4,500 vs expected 30,000-50,000)
Environment	Google Colab A100, PyTorch 2.x, CUDA
Status	Success - 10x+ speedup achieved

Context

Training was extremely slow on A100 Colab GPU despite using "quick_test" mode. Investigation revealed that get_auto_config(training_mode="quick_test") was returning a generic config with n_envs=256 and torch.compile=False, completely ignoring GPU capabilities.

Root Cause

The original get_auto_config() function had training modes that completely bypassed GPU detection:

python

# WRONG - ignores GPU capabilities
def get_auto_config(total_timesteps, training_mode="auto"):
    if training_mode == "quick_test":
        return NativePPOConfig(
            n_envs=256,           # Too low for A100!
            compile_policy=False,  # Missing 3-6x speedup!
            # ... generic settings
        )

Verified Solution

Training modes must layer on top of GPU-specific settings, not replace them:

python

def get_auto_config(total_timesteps=1_000_000, training_mode="auto"):
    # Step 1: ALWAYS detect GPU first
    gpu_tier = _detect_gpu_tier()  # "h100", "a100", "high", "medium", "low"

    # Step 2: Get GPU-appropriate base config
    if gpu_tier == "h100":
        config = _get_h100_base_config()
    elif gpu_tier == "a100":
        config = _get_a100_base_config()
    # ... etc

    # Step 3: Apply training mode ADJUSTMENTS (not replacements)
    if training_mode == "quick_test":
        config.total_timesteps = 10_000_000
        config.validation_interval = 25
        # BUT KEEP GPU-specific n_envs, compile_policy, etc!

GPU Configuration Matrix

GPU Tier	n_envs	n_steps	minibatch	compile	FP8	Expected FPS
H100-80GB	2048	512	8192	True	True	80,000-120,000
A100-80GB	2048	512	8192	True	False	50,000-80,000
A100-40GB	1024	512	4096	True	False	40,000-60,000
RTX 4090	1024	512	4096	True	False	30,000-50,000
RTX 3090	512	512	2048	True	False	20,000-35,000
Generic	256	512	2048	False	False	5,000-15,000

Training Mode Adjustments

Training modes should ONLY adjust these parameters:

Mode	timesteps	n_epochs	validation_interval	Notes
quick_test	10M	10	25	Fast iteration
standard	50M	12	50	Development
production	200M	15	100	Full training
extended	500M	20	200	Maximum learning

GPU Detection Code

python

def _detect_gpu_tier() -> str:
    """Detect GPU tier for optimal configuration."""
    if not torch.cuda.is_available():
        return "cpu"

    gpu_name = torch.cuda.get_device_name(0).lower()
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

    # Check for H100 (compute capability 9.0+)
    compute_cap = torch.cuda.get_device_capability(0)
    if compute_cap[0] >= 9:
        return "h100"

    # Check for A100
    if "a100" in gpu_name:
        return "a100"

    # Tier by VRAM
    if vram_gb >= 40:
        return "high"
    elif vram_gb >= 20:
        return "medium"
    else:
        return "low"

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Training mode completely replaces config	Lost GPU-specific optimizations	Modes should layer adjustments, not replace
n_envs=256 on A100	Only 5-12% GPU utilization	Need 1000+ envs for GPU saturation
compile_policy=False in quick_test	Missing 3-6x speedup	Always enable torch.compile on modern GPUs
Fixed config for all GPUs	Wasted resources or OOM errors	Detect GPU and scale accordingly
Checking GPU only in "auto" mode	quick_test/standard modes got generic config	ALWAYS detect GPU, regardless of mode

Diagnostic Checklist

If training is slow, check these in order:

FPS < 10,000 on A100? → Check n_envs (should be 1024+)
torch.compile: False? → Enable it (3-6x speedup after warmup)
GPU util < 20%? → Increase n_envs
Memory errors? → Decrease n_envs or minibatch_size
H100 with FP8=False? → Enable FP8 for additional speedup

Key Insights

GPU detection must happen FIRST, before applying training modes
Research shows 1000+ parallel environments needed for GPU saturation
torch.compile provides 3-6x speedup but takes 10+ min to warmup
FP8 is only available on Hopper architecture (H100, compute capability 9.0+)
Training modes should adjust timesteps/epochs, NOT hardware-specific params

Quick Fix Command

If you see slow training on A100, the config should show:

n_envs: 1024+
torch.compile: True
compile_mode: reduce-overhead

If any of these are wrong, the get_auto_config() function isn't detecting the GPU properly.

References

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/gpu-aware-training-config
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

GPU-Aware Training Configuration

Experiment Overview

Context

Root Cause

Verified Solution

GPU Configuration Matrix

Training Mode Adjustments

GPU Detection Code

Failed Attempts (Critical)

Diagnostic Checklist

Key Insights

Quick Fix Command

References

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state