Agent skill

distributed-llm-pretraining-torchtitan

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Stars 56,643
Forks 7,481

Install this agent skill to your Project

npx add-skill https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/mlops/torchtitan

Metadata

Additional technical details for this skill

hermes
{
    "tags": [
        "Model Architecture",
        "Distributed Training",
        "TorchTitan",
        "FSDP2",
        "Tensor Parallel",
        "Pipeline Parallel",
        "Context Parallel",
        "Float8",
        "Llama",
        "Pretraining"
    ]
}

SKILL.md

TorchTitan - PyTorch Native Distributed LLM Pretraining

Quick start

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.

Installation:

bash
# From PyPI (stable)
pip install torchtitan

# From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt

Download tokenizer:

bash
# Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

Start training on 8 GPUs:

bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

Workflow 1: Pretrain Llama 3.1 8B on single node

Copy this checklist:

Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint

Step 1: Download tokenizer

bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN

Step 2: Configure training

Edit or create a TOML config file:

toml
# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"

[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"

[optimizer]
name = "AdamW"
lr = 3e-4

[lr_scheduler]
warmup_steps = 200

[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"

[parallelism]
data_parallel_shard_degree = -1  # Use all GPUs for FSDP

[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"

[checkpoint]
enable = true
folder = "checkpoint"
interval = 500

Step 3: Launch training

bash
# 8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

# Or explicitly with torchrun
torchrun --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_8b_custom.toml

Step 4: Monitor and checkpoint

TensorBoard logs are saved to ./outputs/tb/:

bash
tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint

Step 1: Configure parallelism for scale

For 70B model on 256 GPUs (32 nodes):

toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences

Step 2: Set up SLURM script

bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml

Step 3: Submit job

bash
sbatch multinode_trainer.slurm

Step 4: Resume from checkpoint

Training auto-resumes if checkpoint exists in configured folder.

Workflow 3: Enable Float8 training for H100s

Float8 provides 30-50% speedup on H100 GPUs.

Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile

Step 1: Install torchao

bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

Step 2: Configure Float8

Add to your TOML config:

toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]

Step 3: Launch with compile

bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

Workflow 4: 4D parallelism for 405B models

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs

Step 1: Create seed checkpoint

Required for consistent initialization across PP stages:

bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1

Step 2: Configure 4D parallelism

toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192

Step 3: Launch on 512 GPUs

bash
# 64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8 \
  -m torchtitan.train \
  --job.config_file ./llama3_405b.toml

When to use vs alternatives

Use TorchTitan when:

  • Pretraining LLMs from scratch (8B to 405B+)
  • Need PyTorch-native solution without third-party dependencies
  • Require composable 4D parallelism (FSDP2, TP, PP, CP)
  • Training on H100s with Float8 support
  • Want interoperable checkpoints with torchtune/HuggingFace

Use alternatives instead:

  • Megatron-LM: Maximum performance for NVIDIA-only deployments
  • DeepSpeed: Broader ZeRO optimization ecosystem, inference support
  • Axolotl/TRL: Fine-tuning rather than pretraining
  • LitGPT: Educational, smaller-scale training

Common issues

Issue: Out of memory on large models

Enable activation checkpointing and reduce batch size:

toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1

Or use gradient accumulation:

toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients

Issue: TP causes high memory with async collectives

Set environment variable:

bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1

Issue: Float8 training not faster

Float8 only benefits large GEMMs. Filter small layers:

toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]

Issue: Checkpoint loading fails after parallelism change

Use DCP's resharding capability:

bash
# Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils \
  dcp_to_torch checkpoint/step-1000 checkpoint.pt

Issue: Pipeline parallelism initialization

Create seed checkpoint first (see Workflow 4, Step 1).

Supported models

Model Sizes Status
Llama 3.1 8B, 70B, 405B Production
Llama 4 Various Experimental
DeepSeek V3 16B, 236B, 671B (MoE) Experimental
GPT-OSS 20B, 120B (MoE) Experimental
Qwen 3 Various Experimental
Flux Diffusion Experimental

Performance benchmarks (H100)

Model GPUs Parallelism TPS/GPU Techniques
Llama 8B 8 FSDP 5,762 Baseline
Llama 8B 8 FSDP+compile+FP8 8,532 +48%
Llama 70B 256 FSDP+TP+AsyncTP 876 2D parallel
Llama 405B 512 FSDP+TP+PP 128 3D parallel

Advanced topics

FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.

Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.

Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.

Adding custom models: See references/custom-models.md for TrainSpec protocol.

Resources

Expand your agent's capabilities with these related and highly-rated skills.

NousResearch/hermes-agent

agentmail

Give the agent its own dedicated email inbox via AgentMail. Send, receive, and manage email autonomously using agent-owned email addresses (e.g. hermes-agent@agentmail.to).

56,643 7,481
Explore
NousResearch/hermes-agent

base

Query Base (Ethereum L2) blockchain data with USD pricing — wallet balances, token info, transaction details, gas analysis, contract inspection, whale detection, and live network stats. Uses Base RPC + CoinGecko. No API key required.

56,643 7,481
Explore
NousResearch/hermes-agent

solana

Query Solana blockchain data with USD pricing — wallet balances, token portfolios with values, transaction details, NFTs, whale detection, and live network stats. Uses Solana RPC + CoinGecko. No API key required.

56,643 7,481
Explore
NousResearch/hermes-agent

one-three-one-rule

Structured decision-making framework for technical proposals and trade-off analysis. When the user faces a choice between multiple approaches (architecture decisions, tool selection, refactoring strategies, migration paths), this skill produces a 1-3-1 format: one clear problem statement, three distinct options with pros/cons, and one concrete recommendation with definition of done and implementation plan. Use when the user asks for a "1-3-1", says "give me options", or needs help choosing between competing approaches.

56,643 7,481
Explore
NousResearch/hermes-agent

fastmcp

Build, test, inspect, install, and deploy MCP servers with FastMCP in Python. Use when creating a new MCP server, wrapping an API or database as MCP tools, exposing resources or prompts, or preparing a FastMCP server for Claude Code, Cursor, or HTTP deployment.

56,643 7,481
Explore
NousResearch/hermes-agent

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

56,643 7,481
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results