Agent skill
ai-ml-principal-engineer
Principal/Senior-level AI/ML playbook for production machine learning systems, LLM-enabled backends, model serving, training pipelines, evaluation discipline, reliability, security, and MLOps. Use when: designing ML services, building or reviewing training/inference code, selecting model architectures, fine-tuning transformers, hardening model APIs, debugging performance or correctness issues, or preparing ML systems for production.
Install this agent skill to your Project
npx add-skill https://github.com/mOdrA40/claude-codex-skills-directory/tree/main/backend-skills/ai-ml-mastery-skill
SKILL.md
AI/ML Mastery (Senior → Principal)
Operate
- Start by confirming: objective, success metric, data availability, privacy/security constraints, latency and throughput targets, compute budget, deployment target, and the definition of done.
- Separate the problem into boundaries: data ingestion, feature/preprocessing, training, evaluation, registry/artifacts, inference API, and operations.
- Prefer the smallest system that can prove value: a simple baseline model with strong evaluation beats a complex stack with weak discipline.
- Treat ML work as software engineering: reproducibility, observability, rollback, and failure handling are part of the feature.
The goal is not just a high offline metric. The goal is a model-backed backend that is correct, measurable, operable, and safe in production.
Default Standards
- Keep notebooks for exploration only; production logic belongs in versioned Python modules and tests.
- Validate schema, dtypes, ranges, nullability, and label quality at the data boundary.
- Make training and inference preprocessing identical by sharing explicit pipeline code.
- Prefer typed config objects and immutable runtime settings.
- Use structured logging and explicit error taxonomy for data, model, dependency, and serving failures.
- Define latency budgets, timeout behavior, fallback behavior, and model version strategy before exposing public inference endpoints.
- Default to simpler baselines before large models; earn complexity with measured gains.
“Bad vs Good” (common production pitfalls)
# ❌ BAD: training and inference use different preprocessing.
train_text = text.lower().strip()
serve_text = text.strip()
# ✅ GOOD: one shared preprocessing pipeline used everywhere.
normalized_text = text_normalizer.normalize(text)
# ❌ BAD: silent fallback hides model loading failures.
try:
model = load_model(path)
except Exception:
model = None
# ✅ GOOD: fail explicitly or switch to a known degraded mode.
try:
model = load_model(path)
except FileNotFoundError as error:
raise ModelBootstrapError(f"model artifact missing: {path}") from error
# ❌ BAD: unbounded inference call with no deadline.
prediction = client.predict(payload)
# ✅ GOOD: explicit deadline and graceful failure mapping.
prediction = client.predict(payload, timeout=2.0)
Workflow (Feature / Refactor / Bug)
- Define the business outcome, online/offline metrics, and failure tolerance.
- Establish a reproducible baseline and dataset contract.
- Design boundaries between training code, model packaging, and serving code.
- Implement the smallest end-to-end slice with tests and evaluation reports.
- Validate reproducibility, security, performance, and rollback readiness.
- Ship with monitoring for latency, throughput, drift, quality, and cost.
Validation Commands
- Run
python -m pytest. - Run
python -m ruff check .if Ruff is used. - Run
python -m mypy srcfor typed code paths when the repo uses MyPy. - Run
python -m pytest -k inferencefor serving-critical tests. - Run
python -m pytest --maxfail=1 --disable-warningsduring local debugging. - Run smoke evaluation for the current model artifact before release.
- Run container build validation if inference is deployed via Docker.
Backend-Oriented ML Guardrails
- Always version models, prompts, tokenizer assets, and preprocessing artifacts together.
- Do not call external model providers from request paths without timeouts, retries, budgets, and fallback behavior.
- Separate online inference from heavy offline batch jobs.
- Prefer async queue-based processing for expensive enrichment, reranking, or embedding backfills.
- Protect inference endpoints with payload size limits, authn/authz, and rate limiting.
- Log request IDs, model version, feature version, and decision metadata without leaking raw sensitive payloads.
Decision Framework: Library Selection
| Task | Default Choice | Use Alternative When |
|---|---|---|
| Deep learning training | PyTorch | TensorFlow for TPU-heavy production, JAX for research-heavy experimentation |
| Classical/tabular ML | scikit-learn | XGBoost/LightGBM for stronger tabular baselines, CatBoost for categorical-heavy data |
| LLM application layer | transformers + sentence-transformers | vLLM for high-throughput serving, llama.cpp for edge or constrained environments |
| Data processing | pandas | polars for larger columnar workloads, dask/spark for distributed pipelines |
| Experiment tracking | MLflow | Weights & Biases or Neptune when team workflows require hosted collaboration |
| Hyperparameter tuning | Optuna | Ray Tune when you need distributed search orchestration |
Architecture Selection Heuristics
Text classification -> DistilBERT for speed, RoBERTa for stronger accuracy
Embeddings / retrieval -> sentence-transformers or hosted embedding APIs with evaluation gates
Vision classification -> ResNet/EfficientNet as baseline, ViT when data and budget justify it
Object detection -> YOLO for speed, DETR/RT-DETR when workflow favors transformer-based designs
Tabular prediction -> Logistic regression / XGBoost baseline first, deep tabular only if proven necessary
Recommendation -> retrieval + ranking pipelines, not a single monolithic model by default
Time series -> statistical baseline first, then TFT/PatchTST when complexity is justified
Recommended Project Structure
project/
├── pyproject.toml
├── README.md
├── src/
│ └── app/
│ ├── config/
│ ├── data/
│ ├── features/
│ ├── models/
│ ├── training/
│ ├── evaluation/
│ ├── inference/
│ ├── serving/
│ └── observability/
├── tests/
├── scripts/
├── configs/
├── notebooks/
└── docker/
Reliability, Security, and Operations
- Make model bootstrap behavior explicit: fail closed, fail open, or degraded mode.
- Bound input sizes, token counts, image dimensions, and recursion depth for untrusted requests.
- Prefer queue-based retries over client-side blind retries for expensive inference.
- Track feature drift, data freshness, and serving skew between training and production.
- Keep PII out of prompts, logs, traces, and experiment artifacts unless explicitly required and governed.
- Store secrets and provider credentials in secret managers, never in notebooks or source files.
Training and Evaluation Checklist
- Define offline and online success metrics before training
- Fix random seeds when reproducibility matters
- Check train/validation/test leakage
- Validate preprocessing parity between train and serve
- Save model artifact, config, tokenizer, and feature metadata together
- Record dataset version and experiment version
- Benchmark latency, throughput, memory, and cost
- Define rollback or model disable strategy before release
References
- Deep learning systems: references/deep-learning.md
- Transformers and LLMs: references/transformers-llm.md
- Computer vision: references/computer-vision.md
- Classical machine learning: references/machine-learning.md
- NLP systems: references/nlp.md
- MLOps and deployment: references/mlops.md
- Production model serving: references/production-serving.md
- Evaluation and release guardrails: references/evaluation-and-guardrails.md
- Retrieval and RAG systems: references/retrieval-and-rag-systems.md
- Inference reliability and cost control: references/inference-reliability-and-cost.md
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
nuxt-tanstack-mastery
Panduan senior/lead developer 20 tahun pengalaman untuk Vue.js 3 + Nuxt 3 + TanStack Query development. Gunakan skill ini ketika: (1) Membuat project Nuxt 3 baru dengan arsitektur production-ready, (2) Integrasi TanStack Query untuk data fetching, (3) Debugging Vue/Nuxt yang kompleks, (4) Review code untuk clean code compliance, (5) Optimisasi performa aplikasi Vue/Nuxt, (6) Setup folder structure yang scalable, (7) Mencari library terpercaya untuk Vue ecosystem, (8) Menghindari common pitfalls dan bugs, (9) Implementasi state management patterns, (10) Security hardening aplikasi Nuxt. Trigger keywords: vue, vuejs, nuxt, nuxtjs, tanstack, vue-query, composition api, pinia, vueuse, vue router, clean code vue, debugging vue, folder structure nuxt.
solidjs-solidstart-expert
Expert-level SolidJS and SolidStart development skill with 20+ years senior/lead engineer mindset. Comprehensive guidance for building production-ready, scalable web applications with fine-grained reactivity. Use when Claude needs to: (1) Create new SolidJS/SolidStart projects, (2) Implement TanStack Query/Router/Table/Form integration, (3) Build reactive components with signals/stores/resources, (4) Handle SSR/SSG/streaming with SolidStart, (5) Implement authentication and API routes, (6) Optimize bundle size and performance, (7) Debug reactivity issues and memory leaks, (8) Structure large-scale applications, (9) Implement type-safe patterns with TypeScript, (10) Handle error boundaries and suspense, (11) Build accessible UI components, (12) Deploy to Vercel/Netlify/Cloudflare. Triggers: "solid", "solidjs", "solidstart", "createSignal", "createStore", "createResource", "tanstack solid", "vinxi", "fine-grained reactivity".
react-tanstack-senior
Expertise senior/lead React developer 20 tahun dengan TanStack ecosystem (Query, Router, Table, Form, Start). Gunakan skill ini ketika: (1) Membuat aplikasi React dengan TanStack libraries, (2) Review/refactor kode React untuk clean code, (3) Debugging React/TanStack issues, (4) Setup project structure yang maintainable, (5) Optimasi performa React apps, (6) Memilih library yang tepat untuk use case tertentu, (7) Mencegah common bugs dan memory leaks, (8) Implementasi best practices KISS dan less is more. Trigger keywords: React, TanStack, React Query, TanStack Router, TanStack Table, TanStack Form, TanStack Start, Vinxi, clean code, refactor, performance, debugging.
clickhouse-principal-engineer
Principal/Senior-level ClickHouse playbook for analytical schema design, partitioning, ingestion, query performance, replication, storage strategy, and operating large-scale columnar systems. Use when: designing OLAP workloads, reviewing MergeTree layout, tuning analytical queries, building event analytics platforms, or operating ClickHouse in production.
mysql-principal-engineer
Principal/Senior-level MySQL playbook for schema design, indexing, transactions, replication, operational reliability, online migrations, and production workload tuning. Use when: designing relational systems, reviewing query/index strategy, operating MySQL fleets, debugging contention or replication lag, or hardening MySQL-backed applications.
mongodb-principal-engineer
Principal/Senior-level MongoDB playbook for document modeling, indexing, replication, sharding, query design, observability, and production reliability. Use when: designing document schemas, reviewing aggregation/query performance, operating replicas/shards, or hardening MongoDB-backed systems.
Didn't find tool you were looking for?