Agent skill
runtime-skills
Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning.
Install this agent skill to your Project
npx add-skill https://github.com/llama-farm/llamafarm/tree/main/.claude/skills/runtime-skills
SKILL.md
Universal Runtime Skills
Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server.
Overview
The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models:
- Text generation (Causal LMs: GPT, Llama, Mistral, Qwen)
- Text embeddings (BERT, sentence-transformers, ModernBERT)
- Classification, NER, and reranking
- OCR and document understanding
- Anomaly detection
Directory: runtimes/universal/
Python: 3.11+
Key Dependencies: PyTorch, Transformers, FastAPI, llama-cpp-python
Links to Shared Skills
This skill extends the shared Python practices. Always apply these first:
| Topic | File | Priority |
|---|---|---|
| Patterns | python-skills/patterns.md | Medium |
| Async | python-skills/async.md | High |
| Typing | python-skills/typing.md | Medium |
| Testing | python-skills/testing.md | Medium |
| Errors | python-skills/error-handling.md | High |
| Security | python-skills/security.md | Critical |
Runtime-Specific Checklists
| Topic | File | Key Points |
|---|---|---|
| PyTorch | pytorch.md | Device management, dtype, memory cleanup |
| Transformers | transformers.md | Model loading, tokenization, inference |
| FastAPI | fastapi.md | API design, streaming, lifespan |
| Performance | performance.md | Batching, caching, optimizations |
Architecture
runtimes/universal/
├── server.py # FastAPI app, model caching, endpoints
├── core/
│ └── logging.py # UniversalRuntimeLogger (structlog)
├── models/
│ ├── base.py # BaseModel ABC with device management
│ ├── language_model.py # Transformers text generation
│ ├── gguf_language_model.py # llama-cpp-python for GGUF
│ ├── encoder_model.py # Embeddings, classification, NER, reranking
│ └── ... # OCR, anomaly, document models
├── routers/
│ └── chat_completions/ # Chat completions with streaming
├── utils/
│ ├── device.py # Device detection (CUDA/MPS/CPU)
│ ├── model_cache.py # TTL-based model caching
│ ├── model_format.py # GGUF vs transformers detection
│ └── context_calculator.py # GGUF context size computation
└── tests/
Key Patterns
1. Model Loading with Double-Checked Locking
_model_load_lock = asyncio.Lock()
async def load_encoder(model_id: str, task: str = "embedding"):
cache_key = f"encoder:{task}:{model_id}"
if cache_key not in _models:
async with _model_load_lock:
# Double-check after acquiring lock
if cache_key not in _models:
model = EncoderModel(model_id, device, task=task)
await model.load()
_models[cache_key] = model
return _models.get(cache_key)
2. Device-Aware Tensor Operations
class BaseModel(ABC):
def get_dtype(self, force_float32: bool = False):
if force_float32:
return torch.float32
if self.device in ("cuda", "mps"):
return torch.float16
return torch.float32
def to_device(self, tensor: torch.Tensor, dtype=None):
# Don't change dtype for integer tensors
if tensor.dtype in (torch.int32, torch.int64, torch.long):
return tensor.to(device=self.device)
dtype = dtype or self.get_dtype()
return tensor.to(device=self.device, dtype=dtype)
3. TTL-Based Model Caching
_models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5 min TTL
async def _cleanup_idle_models():
while True:
await asyncio.sleep(CLEANUP_CHECK_INTERVAL)
for cache_key, model in _models.pop_expired():
await model.unload()
4. Async Generation with Thread Pools
# GGUF models use blocking llama-cpp, run in executor
self._executor = ThreadPoolExecutor(max_workers=1)
async def generate(self, messages, max_tokens=512, ...):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(self._executor, self._generate_sync)
Review Priority
When reviewing Universal Runtime code:
-
Critical - Security
- Path traversal prevention in file endpoints
- Input sanitization for model IDs
-
High - Memory & Device
- Proper CUDA/MPS cache clearing on unload
- torch.no_grad() for inference
- Correct dtype for device
-
Medium - Performance
- Model caching patterns
- Batch processing where applicable
- Streaming implementation
-
Low - Code Style
- Consistent with patterns.md
- Proper type hints
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
common-skills
Best practices for the Common utilities package in LlamaFarm. Covers HuggingFace Hub integration, GGUF model management, and shared utilities.
typescript-skills
Shared TypeScript best practices for Designer and Electron subsystems.
wt
Manage LlamaFarm worktrees for isolated parallel development. Create, start, stop, and clean up worktrees.
generate-subsystem-skills
Generate specialized skills for each subsystem in the monorepo. Creates shared language skills and subsystem-specific checklists for high-quality AI code generation.
temp-files
Guidelines for creating temporary files in system temp directory. Use when agents need to create reports, logs, or progress files without cluttering the repository.
code-review
Comprehensive code review for diffs. Analyzes changed code for security vulnerabilities, anti-patterns, and quality issues. Auto-detects domain (frontend/backend) from file paths.
Didn't find tool you were looking for?