Agent skill
coreweave-core-workflow-b
Run distributed GPU training jobs on CoreWeave with multi-node PyTorch. Use when training models across multiple GPUs, setting up distributed training, or running fine-tuning jobs on CoreWeave H100 clusters. Trigger with phrases like "coreweave training", "coreweave multi-gpu", "distributed training coreweave", "fine-tune on coreweave".
Install this agent skill to your Project
npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/coreweave-pack/skills/coreweave-core-workflow-b
SKILL.md
CoreWeave Core Workflow: GPU Training
Overview
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
Prerequisites
- CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
- Shared storage (CoreWeave PVC or NFS)
- Training container with PyTorch and NCCL
Instructions
Step 1: Single-Node Multi-GPU Training
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: ghcr.io/myorg/trainer:latest
command: ["torchrun"]
args:
- "--nproc_per_node=8"
- "train.py"
- "--model_name=meta-llama/Llama-3.1-8B"
- "--batch_size=4"
- "--epochs=3"
resources:
limits:
nvidia.com/gpu: "8"
memory: 512Gi
cpu: "64"
volumeMounts:
- name: data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: data
persistentVolumeClaim:
claimName: training-data
- name: checkpoints
persistentVolumeClaim:
claimName: model-checkpoints
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_NVLINK_A100_SXM4_80GB"]
Step 2: Persistent Storage for Training Data
# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 500Gi
storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-checkpoints
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 200Gi
storageClassName: shared-ssd-ord1
Step 3: Monitor Training Progress
# Watch training logs
kubectl logs -f job/llm-finetune
# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi
# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
cat /checkpoints/training_log.json | tail -5
Error Handling
| Error | Cause | Solution |
|---|---|---|
| NCCL timeout | Network issue between GPUs | Use NVLink nodes (SXM4/SXM5) |
| OOMKilled | Batch size too large | Reduce batch size or use gradient accumulation |
| Checkpoint save failed | PVC full | Increase storage or prune old checkpoints |
| Job evicted | Preemption | Use on-demand nodes for training |
Resources
Next Steps
For troubleshooting, see coreweave-common-errors.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
dockerfile-generator
Dockerfile Generator - Auto-activating skill for DevOps Basics. Triggers on: dockerfile generator, dockerfile generator Part of the DevOps Basics skill category.
branch-naming-helper
Branch Naming Helper - Auto-activating skill for DevOps Basics. Triggers on: branch naming helper, branch naming helper Part of the DevOps Basics skill category.
readme-generator
Readme Generator - Auto-activating skill for DevOps Basics. Triggers on: readme generator, readme generator Part of the DevOps Basics skill category.
makefile-generator
Makefile Generator - Auto-activating skill for DevOps Basics. Triggers on: makefile generator, makefile generator Part of the DevOps Basics skill category.
gitignore-generator
Gitignore Generator - Auto-activating skill for DevOps Basics. Triggers on: gitignore generator, gitignore generator Part of the DevOps Basics skill category.
pre-commit-hook-setup
Pre Commit Hook Setup - Auto-activating skill for DevOps Basics. Triggers on: pre commit hook setup, pre commit hook setup Part of the DevOps Basics skill category.
Didn't find tool you were looking for?