Agent skill

funsloth-runpod

Training manager for RunPod GPU instances - configure pods, launch training, monitor progress, retrieve checkpoints

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/funsloth-runpod

SKILL.md

RunPod Training Manager

Run Unsloth training on RunPod GPU instances.

Prerequisites

  1. RunPod API Key: echo $RUNPOD_API_KEY (get at runpod.io/console/user/settings)
  2. RunPod SDK: pip install runpod
  3. Training notebook/script: From funsloth-train

Workflow

1. Select GPU

GPU VRAM Cost Best For
RTX 3090 24GB ~$0.35/hr Budget 7-14B
RTX 4090 24GB ~$0.55/hr Fast 7-14B
A100 40GB 40GB ~$1.50/hr 14-34B
A100 80GB 80GB ~$2.00/hr 70B
H100 80GB ~$3.50/hr Fastest

RunPod typically has better prices than HF Jobs.

2. Choose Deployment

  • Pod (Recommended): Persistent, SSH access, network storage
  • Serverless: Pay per second, complex setup (better for inference)

3. Configure Network Volume (Recommended)

python
import runpod
volume = runpod.create_network_volume(name="funsloth-training", size_gb=50, region="US")

Allows: resume training, download checkpoints, share between pods.

4. Launch Pod

Use the official Unsloth Docker image for a pre-configured environment:

python
import runpod

pod = runpod.create_pod(
    name="funsloth-training",
    image_name="unsloth/unsloth",  # Official image, supports all GPUs incl. Blackwell
    gpu_type_id="{gpu_type}",
    volume_in_gb=50,
    network_volume_id="{volume_id}",
    env={
        "HF_TOKEN": "{token}",
        "WANDB_API_KEY": "{key}",
        "JUPYTER_PASSWORD": "unsloth",
    },
    ports="8888/http,22/tcp",
)

The Unsloth image includes Jupyter Lab (port 8888) and example notebooks in /workspace/unsloth-notebooks/.

5. Upload and Run

bash
# SSH into pod
ssh root@{pod_ip}

# Upload script
scp train.py root@{pod_ip}:/workspace/

# Run training (use tmux for persistence)
tmux new -s training
cd /workspace && python train.py
# Ctrl+B, D to detach

6. Monitor

bash
# SSH monitoring
tail -f /workspace/training.log
nvidia-smi -l 1

# Dashboard
https://runpod.io/console/pods/{pod_id}

7. Retrieve Checkpoints

bash
# Save to network volume
cp -r /workspace/outputs /runpod-volume/

# Download via SCP
scp -r root@{pod_ip}:/workspace/outputs ./

# Or push to HF Hub from pod

8. Stop Pod

python
runpod.stop_pod(pod_id)    # Can resume later
runpod.terminate_pod(pod_id)  # Deletes pod, keeps volume

9. Handoff

Offer funsloth-upload for Hub upload with model card.

Best Practices

  1. Always use network volumes - pod storage is ephemeral
  2. Use spot instances for lower costs (risk of preemption)
  3. Set up SSH keys before creating pods
  4. Stop pods when not training - charges per minute
  5. Save checkpoints frequently with save_steps

Error Handling

Error Resolution
Pod creation failed Try different GPU type or region
SSH refused Wait 1-2 min, check IP
Out of disk Increase volume or clean up
Volume not mounting Check same region as pod

Bundled Resources

Didn't find tool you were looking for?

Be as detailed as possible for better results