Agent skills
ML Experiment Tracking

Agent skill

ML Experiment Tracking

Managing ML experiments, metrics, parameters, and artifacts using MLflow, Weights & Biases, and best practices for reproducible ML experiments and model versioning.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/model-experiments

SKILL.md

ML Experiment Tracking

Current Level: Advanced
Domain: Data Science / ML / Experimentation

Overview

Experiment tracking manages ML experiments, metrics, parameters, and artifacts. This guide covers MLflow, Weights & Biases, and best practices for tracking experiments, comparing models, and ensuring reproducibility in ML development.

Experiment Tracking Importance

Benefits:

Reproducibility
Comparison of experiments
Collaboration
Model versioning
Hyperparameter optimization

MLflow

Installation

bash

pip install mlflow
mlflow ui  # Start UI on http://localhost:5000

Tracking

python

# MLflow tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Set experiment
mlflow.set_experiment("my-experiment")

# Start run
with mlflow.start_run(run_name="random-forest-v1"):
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("random_state", 42)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred, average='weighted'))
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log artifacts
    import matplotlib.pyplot as plt
    
    plt.figure()
    plt.plot(model.feature_importances_)
    plt.savefig("feature_importance.png")
    mlflow.log_artifact("feature_importance.png")
    
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Autologging

python

# Automatic logging
import mlflow.sklearn

mlflow.sklearn.autolog()

# Train model (automatically logged)
model = RandomForestClassifier()
model.fit(X_train, y_train)

Model Registry

python

# Register model
import mlflow

# Log and register model
with mlflow.start_run():
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="my-model"
    )

# Load registered model
model_uri = "models:/my-model/1"  # Version 1
loaded_model = mlflow.sklearn.load_model(model_uri)

# Transition model stage
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="my-model",
    version=1,
    stage="Production"
)

# Load production model
model_uri = "models:/my-model/Production"
production_model = mlflow.sklearn.load_model(model_uri)

Projects

python

# MLproject file
name: my-ml-project

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      n_estimators: {type: int, default: 100}
      max_depth: {type: int, default: 10}
    command: "python train.py --n_estimators {n_estimators} --max_depth {max_depth}"

# Run project
# mlflow run . -P n_estimators=200 -P max_depth=15

Weights & Biases

python

# Weights & Biases tracking
import wandb
from sklearn.ensemble import RandomForestClassifier

# Initialize run
wandb.init(
    project="my-project",
    name="random-forest-v1",
    config={
        "n_estimators": 100,
        "max_depth": 10,
        "learning_rate": 0.01
    }
)

# Train model
model = RandomForestClassifier(
    n_estimators=wandb.config.n_estimators,
    max_depth=wandb.config.max_depth
)
model.fit(X_train, y_train)

# Log metrics
y_pred = model.predict(X_test)
wandb.log({
    "accuracy": accuracy_score(y_test, y_pred),
    "f1_score": f1_score(y_test, y_pred, average='weighted')
})

# Log confusion matrix
wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=class_names)

# Log feature importance
wandb.sklearn.plot_feature_importances(model, feature_names)

# Save model
wandb.save('model.pkl')

# Finish run
wandb.finish()

Hyperparameter Sweeps

python

# W&B sweep configuration
sweep_config = {
    'method': 'bayes',  # or 'grid', 'random'
    'metric': {
        'name': 'accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'n_estimators': {
            'values': [50, 100, 200]
        },
        'max_depth': {
            'min': 5,
            'max': 20
        },
        'learning_rate': {
            'distribution': 'log_uniform',
            'min': -5,
            'max': 0
        }
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")

# Training function
def train():
    wandb.init()
    
    # Get hyperparameters
    config = wandb.config
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=config.n_estimators,
        max_depth=config.max_depth
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    accuracy = model.score(X_test, y_test)
    wandb.log({"accuracy": accuracy})

# Run sweep
wandb.agent(sweep_id, train, count=10)

TensorBoard

python

# TensorBoard logging
from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn

# Create writer
writer = SummaryWriter('runs/experiment_1')

# Log scalars
for epoch in range(100):
    loss = train_epoch(model, train_loader)
    accuracy = evaluate(model, test_loader)
    
    writer.add_scalar('Loss/train', loss, epoch)
    writer.add_scalar('Accuracy/test', accuracy, epoch)

# Log model graph
writer.add_graph(model, input_tensor)

# Log images
writer.add_image('predictions', img_grid, epoch)

# Log histograms
for name, param in model.named_parameters():
    writer.add_histogram(name, param, epoch)

# Close writer
writer.close()

# View in TensorBoard
# tensorboard --logdir=runs

Metrics Logging

python

# Comprehensive metrics logging
class MetricsLogger:
    def __init__(self, experiment_name: str):
        self.experiment_name = experiment_name
        mlflow.set_experiment(experiment_name)
    
    def log_classification_metrics(self, y_true, y_pred, y_prob=None):
        """Log classification metrics"""
        from sklearn.metrics import (
            accuracy_score,
            precision_score,
            recall_score,
            f1_score,
            roc_auc_score,
            confusion_matrix
        )
        
        metrics = {
            "accuracy": accuracy_score(y_true, y_pred),
            "precision": precision_score(y_true, y_pred, average='weighted'),
            "recall": recall_score(y_true, y_pred, average='weighted'),
            "f1_score": f1_score(y_true, y_pred, average='weighted')
        }
        
        if y_prob is not None:
            metrics["roc_auc"] = roc_auc_score(y_true, y_prob, multi_class='ovr')
        
        for name, value in metrics.items():
            mlflow.log_metric(name, value)
        
        # Log confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        self.plot_confusion_matrix(cm)
    
    def log_regression_metrics(self, y_true, y_pred):
        """Log regression metrics"""
        from sklearn.metrics import (
            mean_squared_error,
            mean_absolute_error,
            r2_score
        )
        
        metrics = {
            "mse": mean_squared_error(y_true, y_pred),
            "rmse": mean_squared_error(y_true, y_pred, squared=False),
            "mae": mean_absolute_error(y_true, y_pred),
            "r2": r2_score(y_true, y_pred)
        }
        
        for name, value in metrics.items():
            mlflow.log_metric(name, value)
    
    def plot_confusion_matrix(self, cm):
        """Plot and log confusion matrix"""
        import matplotlib.pyplot as plt
        import seaborn as sns
        
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.savefig('confusion_matrix.png')
        mlflow.log_artifact('confusion_matrix.png')
        plt.close()

Hyperparameter Tracking

python

# Track hyperparameters with Optuna + MLflow
import optuna
import mlflow

def objective(trial):
    """Optuna objective function"""
    with mlflow.start_run(nested=True):
        # Suggest hyperparameters
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 200),
            'max_depth': trial.suggest_int('max_depth', 5, 20),
            'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1e-1)
        }
        
        # Log parameters
        mlflow.log_params(params)
        
        # Train model
        model = RandomForestClassifier(**params)
        model.fit(X_train, y_train)
        
        # Evaluate
        accuracy = model.score(X_test, y_test)
        
        # Log metric
        mlflow.log_metric("accuracy", accuracy)
        
        return accuracy

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

# Best parameters
print(f"Best params: {study.best_params}")
print(f"Best value: {study.best_value}")

Artifact Storage

python

# Store various artifacts
with mlflow.start_run():
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log dataset
    mlflow.log_artifact("data/train.csv", "datasets")
    
    # Log plots
    mlflow.log_artifact("plots/feature_importance.png", "plots")
    
    # Log configuration
    import json
    with open("config.json", "w") as f:
        json.dump(config, f)
    mlflow.log_artifact("config.json")
    
    # Log dictionary as JSON
    mlflow.log_dict({"key": "value"}, "metadata.json")
    
    # Log text
    mlflow.log_text("Model description", "description.txt")

Experiment Comparison

python

# Compare experiments
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get experiment
experiment = client.get_experiment_by_name("my-experiment")

# Get all runs
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.accuracy DESC"],
    max_results=10
)

# Compare runs
import pandas as pd

comparison = []
for run in runs:
    comparison.append({
        'run_id': run.info.run_id,
        'accuracy': run.data.metrics.get('accuracy'),
        'f1_score': run.data.metrics.get('f1_score'),
        'n_estimators': run.data.params.get('n_estimators'),
        'max_depth': run.data.params.get('max_depth')
    })

df = pd.DataFrame(comparison)
print(df)

Reproducibility

python

# Ensure reproducibility
import random
import numpy as np
import torch

def set_seed(seed=42):
    """Set random seeds for reproducibility"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Log environment
with mlflow.start_run():
    # Log seed
    mlflow.log_param("random_seed", 42)
    
    # Log Python version
    import sys
    mlflow.log_param("python_version", sys.version)
    
    # Log package versions
    import pkg_resources
    packages = [str(d) for d in pkg_resources.working_set]
    mlflow.log_text("\n".join(packages), "requirements.txt")
    
    # Log git commit
    import subprocess
    commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
    mlflow.log_param("git_commit", commit)

Best Practices

Track Everything - Log params, metrics, artifacts
Naming - Use descriptive experiment names
Versioning - Version datasets and models
Reproducibility - Set seeds and log environment
Comparison - Compare experiments systematically
Cleanup - Archive old experiments
Documentation - Document experiment goals
Collaboration - Share experiments with team
Automation - Automate logging
Storage - Manage artifact storage

Quick Start

MLflow Tracking

python

import mlflow

mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)
    
    # Train model
    model = train_model(X_train, y_train)
    
    # Log metrics
    accuracy = evaluate_model(model, X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")

Production Checklist

Experiment Tracking: Set up experiment tracking
Parameter Logging: Log all hyperparameters
Metric Logging: Log all metrics
Artifact Storage: Store model artifacts
Reproducibility: Ensure reproducibility
Comparison: Compare experiments
Cleanup: Archive old experiments
Documentation: Document experiment goals
Collaboration: Share experiments
Automation: Automate logging
Storage: Manage artifact storage
Versioning: Model versioning

Anti-patterns

❌ Don't: No Tracking

python

# ❌ Bad - No tracking
model = train_model(X, y)
# No record of what was done!

python

# ✅ Good - Track everything
with mlflow.start_run():
    mlflow.log_params(params)
    model = train_model(X, y)
    mlflow.log_metrics(metrics)
    mlflow.log_model(model)

❌ Don't: Inconsistent Logging

python

# ❌ Bad - Inconsistent
run1: log_metric("acc", 0.95)
run2: log_metric("accuracy", 0.96)
# Different metric names!

python

# ✅ Good - Consistent
run1: log_metric("accuracy", 0.95)
run2: log_metric("accuracy", 0.96)
# Same metric names

Integration Points

Model Training (05-ai-ml-core/model-training/) - Training process
Feature Engineering (39-data-science-ml/feature-engineering/) - Features
ML Serving (39-data-science-ml/ml-serving/) - Model deployment

Resources

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/model-experiments
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

ML Experiment Tracking

Install this agent skill to your Project

SKILL.md

ML Experiment Tracking

Overview

Experiment Tracking Importance

MLflow

Installation

Tracking

Autologging

Model Registry

Projects

Weights & Biases

Hyperparameter Sweeps

TensorBoard

Metrics Logging

Hyperparameter Tracking

Artifact Storage

Experiment Comparison

Reproducibility

Best Practices

Quick Start

MLflow Tracking

Production Checklist

Anti-patterns

❌ Don't: No Tracking

❌ Don't: Inconsistent Logging

Integration Points

Further Reading

Resources

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state