Agent skill

ops-runpod

Provision, manage, and terminate RunPod GPU instances for LLM training. Use when user says "spin up GPU", "create RunPod instance", "terminate pod", "check GPU status", "provision training server", or needs cloud GPU resources.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/ops-runpod

Metadata

Additional technical details for this skill

project path
$RUNPOD_OPS_REPO (set via env; defaults to GitHub clone)
short description
RunPod GPU instance management

SKILL.md

RunPod Operations Skill

Manage RunPod GPU instances for LLM training and inference.

Self-contained skill - auto-installs via uv run from git (no pre-installation needed).

Quick Start

bash
# Via wrapper (auto-installs from GitHub on demand)
.agents/skills/ops-runpod/run.sh list-instances

# Create an instance
.agents/skills/ops-runpod/run.sh create-instance 70B --hours 4

# Monitor an instance
.agents/skills/ops-runpod/run.sh monitor <pod-id>

# Terminate an instance
.agents/skills/ops-runpod/run.sh terminate <pod-id>

Commands

Command Purpose
create-instance Create GPU pod optimized for model size
list-instances Show all running pods with status/cost
monitor Live monitoring of pod metrics
terminate Safely terminate a pod
estimate-cost Estimate training cost before creating
optimize Find optimal GPU config using benchmarks
start-training Start a training job on RunPod
serve Start inference server on RunPod

Examples

Create Instance

.agents/skills/ops-runpod/run.sh create-instance 70B --hours 4

Returns: pod_id=abc123


### Estimate Cost
```bash
.agents/skills/ops-runpod/run.sh estimate-cost 70B --hours 8

Monitor Instance

bash
.agents/skills/ops-runpod/run.sh monitor <pod-id>

Terminate Instance

bash
.agents/skills/ops-runpod/run.sh terminate <pod-id>

Environment Variables

Variable Required Description
RUNPOD_API_KEY Yes RunPod API key

Typical Workflow

bash
# 1. Estimate cost
.agents/skills/ops-runpod/run.sh estimate-cost 70B --hours 8

# 2. Create pod
.agents/skills/ops-runpod/run.sh create-instance 70B --hours 8

# 3. Monitor or SSH into pod
.agents/skills/ops-runpod/run.sh monitor <pod-id>

# 4. Run training...

# 5. Terminate when done
.agents/skills/ops-runpod/run.sh terminate <pod-id>

Integration with Memory

After training completes, log the lesson:

bash
memory-agent learn \
  --problem "Training Qwen3-70B on RunPod" \
  --solution "Used A100-80GB, 8 hours, cost $24.72. Config: lr=2e-5, batch=4"

Didn't find tool you were looking for?

Be as detailed as possible for better results