Agent skill

ai-ml-principal-engineer

Principal/Senior-level AI/ML playbook for production machine learning systems, LLM-enabled backends, model serving, training pipelines, evaluation discipline, reliability, security, and MLOps. Use when: designing ML services, building or reviewing training/inference code, selecting model architectures, fine-tuning transformers, hardening model APIs, debugging performance or correctness issues, or preparing ML systems for production.

Stars 4
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/mOdrA40/claude-codex-skills-directory/tree/main/backend-skills/ai-ml-mastery-skill

SKILL.md

AI/ML Mastery (Senior → Principal)

Operate

  • Start by confirming: objective, success metric, data availability, privacy/security constraints, latency and throughput targets, compute budget, deployment target, and the definition of done.
  • Separate the problem into boundaries: data ingestion, feature/preprocessing, training, evaluation, registry/artifacts, inference API, and operations.
  • Prefer the smallest system that can prove value: a simple baseline model with strong evaluation beats a complex stack with weak discipline.
  • Treat ML work as software engineering: reproducibility, observability, rollback, and failure handling are part of the feature.

The goal is not just a high offline metric. The goal is a model-backed backend that is correct, measurable, operable, and safe in production.

Default Standards

  • Keep notebooks for exploration only; production logic belongs in versioned Python modules and tests.
  • Validate schema, dtypes, ranges, nullability, and label quality at the data boundary.
  • Make training and inference preprocessing identical by sharing explicit pipeline code.
  • Prefer typed config objects and immutable runtime settings.
  • Use structured logging and explicit error taxonomy for data, model, dependency, and serving failures.
  • Define latency budgets, timeout behavior, fallback behavior, and model version strategy before exposing public inference endpoints.
  • Default to simpler baselines before large models; earn complexity with measured gains.

“Bad vs Good” (common production pitfalls)

python
# ❌ BAD: training and inference use different preprocessing.
train_text = text.lower().strip()
serve_text = text.strip()

# ✅ GOOD: one shared preprocessing pipeline used everywhere.
normalized_text = text_normalizer.normalize(text)
python
# ❌ BAD: silent fallback hides model loading failures.
try:
    model = load_model(path)
except Exception:
    model = None

# ✅ GOOD: fail explicitly or switch to a known degraded mode.
try:
    model = load_model(path)
except FileNotFoundError as error:
    raise ModelBootstrapError(f"model artifact missing: {path}") from error
python
# ❌ BAD: unbounded inference call with no deadline.
prediction = client.predict(payload)

# ✅ GOOD: explicit deadline and graceful failure mapping.
prediction = client.predict(payload, timeout=2.0)

Workflow (Feature / Refactor / Bug)

  1. Define the business outcome, online/offline metrics, and failure tolerance.
  2. Establish a reproducible baseline and dataset contract.
  3. Design boundaries between training code, model packaging, and serving code.
  4. Implement the smallest end-to-end slice with tests and evaluation reports.
  5. Validate reproducibility, security, performance, and rollback readiness.
  6. Ship with monitoring for latency, throughput, drift, quality, and cost.

Validation Commands

  • Run python -m pytest.
  • Run python -m ruff check . if Ruff is used.
  • Run python -m mypy src for typed code paths when the repo uses MyPy.
  • Run python -m pytest -k inference for serving-critical tests.
  • Run python -m pytest --maxfail=1 --disable-warnings during local debugging.
  • Run smoke evaluation for the current model artifact before release.
  • Run container build validation if inference is deployed via Docker.

Backend-Oriented ML Guardrails

  • Always version models, prompts, tokenizer assets, and preprocessing artifacts together.
  • Do not call external model providers from request paths without timeouts, retries, budgets, and fallback behavior.
  • Separate online inference from heavy offline batch jobs.
  • Prefer async queue-based processing for expensive enrichment, reranking, or embedding backfills.
  • Protect inference endpoints with payload size limits, authn/authz, and rate limiting.
  • Log request IDs, model version, feature version, and decision metadata without leaking raw sensitive payloads.

Decision Framework: Library Selection

Task Default Choice Use Alternative When
Deep learning training PyTorch TensorFlow for TPU-heavy production, JAX for research-heavy experimentation
Classical/tabular ML scikit-learn XGBoost/LightGBM for stronger tabular baselines, CatBoost for categorical-heavy data
LLM application layer transformers + sentence-transformers vLLM for high-throughput serving, llama.cpp for edge or constrained environments
Data processing pandas polars for larger columnar workloads, dask/spark for distributed pipelines
Experiment tracking MLflow Weights & Biases or Neptune when team workflows require hosted collaboration
Hyperparameter tuning Optuna Ray Tune when you need distributed search orchestration

Architecture Selection Heuristics

text
Text classification          -> DistilBERT for speed, RoBERTa for stronger accuracy
Embeddings / retrieval       -> sentence-transformers or hosted embedding APIs with evaluation gates
Vision classification        -> ResNet/EfficientNet as baseline, ViT when data and budget justify it
Object detection             -> YOLO for speed, DETR/RT-DETR when workflow favors transformer-based designs
Tabular prediction           -> Logistic regression / XGBoost baseline first, deep tabular only if proven necessary
Recommendation               -> retrieval + ranking pipelines, not a single monolithic model by default
Time series                  -> statistical baseline first, then TFT/PatchTST when complexity is justified

Recommended Project Structure

text
project/
├── pyproject.toml
├── README.md
├── src/
│   └── app/
│       ├── config/
│       ├── data/
│       ├── features/
│       ├── models/
│       ├── training/
│       ├── evaluation/
│       ├── inference/
│       ├── serving/
│       └── observability/
├── tests/
├── scripts/
├── configs/
├── notebooks/
└── docker/

Reliability, Security, and Operations

  • Make model bootstrap behavior explicit: fail closed, fail open, or degraded mode.
  • Bound input sizes, token counts, image dimensions, and recursion depth for untrusted requests.
  • Prefer queue-based retries over client-side blind retries for expensive inference.
  • Track feature drift, data freshness, and serving skew between training and production.
  • Keep PII out of prompts, logs, traces, and experiment artifacts unless explicitly required and governed.
  • Store secrets and provider credentials in secret managers, never in notebooks or source files.

Training and Evaluation Checklist

  • Define offline and online success metrics before training
  • Fix random seeds when reproducibility matters
  • Check train/validation/test leakage
  • Validate preprocessing parity between train and serve
  • Save model artifact, config, tokenizer, and feature metadata together
  • Record dataset version and experiment version
  • Benchmark latency, throughput, memory, and cost
  • Define rollback or model disable strategy before release

References

  • Deep learning systems: references/deep-learning.md
  • Transformers and LLMs: references/transformers-llm.md
  • Computer vision: references/computer-vision.md
  • Classical machine learning: references/machine-learning.md
  • NLP systems: references/nlp.md
  • MLOps and deployment: references/mlops.md
  • Production model serving: references/production-serving.md
  • Evaluation and release guardrails: references/evaluation-and-guardrails.md
  • Retrieval and RAG systems: references/retrieval-and-rag-systems.md
  • Inference reliability and cost control: references/inference-reliability-and-cost.md

Expand your agent's capabilities with these related and highly-rated skills.

mOdrA40/claude-codex-skills-directory

nuxt-tanstack-mastery

Panduan senior/lead developer 20 tahun pengalaman untuk Vue.js 3 + Nuxt 3 + TanStack Query development. Gunakan skill ini ketika: (1) Membuat project Nuxt 3 baru dengan arsitektur production-ready, (2) Integrasi TanStack Query untuk data fetching, (3) Debugging Vue/Nuxt yang kompleks, (4) Review code untuk clean code compliance, (5) Optimisasi performa aplikasi Vue/Nuxt, (6) Setup folder structure yang scalable, (7) Mencari library terpercaya untuk Vue ecosystem, (8) Menghindari common pitfalls dan bugs, (9) Implementasi state management patterns, (10) Security hardening aplikasi Nuxt. Trigger keywords: vue, vuejs, nuxt, nuxtjs, tanstack, vue-query, composition api, pinia, vueuse, vue router, clean code vue, debugging vue, folder structure nuxt.

4 0
Explore
mOdrA40/claude-codex-skills-directory

solidjs-solidstart-expert

Expert-level SolidJS and SolidStart development skill with 20+ years senior/lead engineer mindset. Comprehensive guidance for building production-ready, scalable web applications with fine-grained reactivity. Use when Claude needs to: (1) Create new SolidJS/SolidStart projects, (2) Implement TanStack Query/Router/Table/Form integration, (3) Build reactive components with signals/stores/resources, (4) Handle SSR/SSG/streaming with SolidStart, (5) Implement authentication and API routes, (6) Optimize bundle size and performance, (7) Debug reactivity issues and memory leaks, (8) Structure large-scale applications, (9) Implement type-safe patterns with TypeScript, (10) Handle error boundaries and suspense, (11) Build accessible UI components, (12) Deploy to Vercel/Netlify/Cloudflare. Triggers: "solid", "solidjs", "solidstart", "createSignal", "createStore", "createResource", "tanstack solid", "vinxi", "fine-grained reactivity".

4 0
Explore
mOdrA40/claude-codex-skills-directory

react-tanstack-senior

Expertise senior/lead React developer 20 tahun dengan TanStack ecosystem (Query, Router, Table, Form, Start). Gunakan skill ini ketika: (1) Membuat aplikasi React dengan TanStack libraries, (2) Review/refactor kode React untuk clean code, (3) Debugging React/TanStack issues, (4) Setup project structure yang maintainable, (5) Optimasi performa React apps, (6) Memilih library yang tepat untuk use case tertentu, (7) Mencegah common bugs dan memory leaks, (8) Implementasi best practices KISS dan less is more. Trigger keywords: React, TanStack, React Query, TanStack Router, TanStack Table, TanStack Form, TanStack Start, Vinxi, clean code, refactor, performance, debugging.

4 0
Explore
mOdrA40/claude-codex-skills-directory

clickhouse-principal-engineer

Principal/Senior-level ClickHouse playbook for analytical schema design, partitioning, ingestion, query performance, replication, storage strategy, and operating large-scale columnar systems. Use when: designing OLAP workloads, reviewing MergeTree layout, tuning analytical queries, building event analytics platforms, or operating ClickHouse in production.

4 0
Explore
mOdrA40/claude-codex-skills-directory

mysql-principal-engineer

Principal/Senior-level MySQL playbook for schema design, indexing, transactions, replication, operational reliability, online migrations, and production workload tuning. Use when: designing relational systems, reviewing query/index strategy, operating MySQL fleets, debugging contention or replication lag, or hardening MySQL-backed applications.

4 0
Explore
mOdrA40/claude-codex-skills-directory

mongodb-principal-engineer

Principal/Senior-level MongoDB playbook for document modeling, indexing, replication, sharding, query design, observability, and production reliability. Use when: designing document schemas, reviewing aggregation/query performance, operating replicas/shards, or hardening MongoDB-backed systems.

4 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results