Agent skills
bio-machine-learning-model-val...

Agent skill

bio-machine-learning-model-validation

Stars 2,009

Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-machine-learning-model-validation

SKILL.md

name: bio-machine-learning-model-validation description: Implements nested cross-validation and stratified splits for unbiased model evaluation on biomedical datasets. Prevents data leakage and overfitting in biomarker discovery. Use when validating classifiers or optimizing hyperparameters on omics data. tool_type: python primary_tool: sklearn measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Cross-Validation for Biomedical Data

Why Nested CV Matters

Simple train/test splits overestimate performance on small omics datasets. Nested CV provides unbiased estimates by separating hyperparameter tuning from performance evaluation.

Nested Cross-Validation

python

from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [5, 10, None]
}

# Outer CV: performance estimation (5 folds)
# Inner CV: hyperparameter tuning (3 folds)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1)
    grid.fit(X_train, y_train)
    score = grid.score(X_test, y_test)
    nested_scores.append(score)

print(f'Nested CV AUC: {np.mean(nested_scores):.3f} +/- {np.std(nested_scores):.3f}')

Stratified K-Fold

python

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Always stratify for class imbalance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')

Repeated Stratified K-Fold

python

from sklearn.model_selection import RepeatedStratifiedKFold

# More robust estimate with multiple repeats
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'Repeated CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')

Leave-One-Out (Small Datasets)

python

from sklearn.model_selection import LeaveOneOut, cross_val_predict

# Use for very small datasets (n < 30)
loo = LeaveOneOut()
y_pred = cross_val_predict(pipe, X, y, cv=loo, method='predict_proba')[:, 1]
auc = roc_auc_score(y, y_pred)
print(f'LOO AUC: {auc:.3f}')

Group-Aware Splits

python

from sklearn.model_selection import GroupKFold, LeaveOneGroupOut

# When samples from same patient/batch must stay together
groups = meta['patient_id'].values
group_cv = GroupKFold(n_splits=5)
scores = cross_val_score(pipe, X, y, cv=group_cv, groups=groups, scoring='roc_auc')

CV Strategy Selection

Dataset Size	Strategy	Notes
n > 100	StratifiedKFold(5)	Standard choice
n = 50-100	StratifiedKFold(10)	More train data per fold
n < 30	LeaveOneOut	Maximum train data
Repeated measures	GroupKFold	Keep patients together
High variance	RepeatedStratifiedKFold	More stable estimates

Avoiding Data Leakage

python

# WRONG: Feature selection before CV
# selected = SelectKBest(k=100).fit_transform(X, y)  # Leaks info!
# scores = cross_val_score(clf, selected, y, cv=cv)

# CORRECT: Feature selection inside CV
from sklearn.feature_selection import SelectKBest

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('select', SelectKBest(k=100)),  # Done per fold
    ('clf', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')

Related Skills

machine-learning/omics-classifiers - Model training
experimental-design/multiple-testing - Multiple hypothesis correction
machine-learning/biomarker-discovery - Feature selection within CV

Maintainer

FreedomIntelligence Core maintainer

Source details

Full Name: FreedomIntelligence/OpenClaw-Medical-Skills
Branch: main
Path in repo: skills/bio-machine-learning-model-validation
Topics: claude-code skills openclaw awesome clawhub openclaw-skills medical nanoclaw

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量，并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275

Explore

FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Cross-Validation for Biomedical Data

Why Nested CV Matters

Nested Cross-Validation

Stratified K-Fold

Repeated Stratified K-Fold

Leave-One-Out (Small Datasets)

Group-Aware Splits

CV Strategy Selection

Avoiding Data Leakage

Related Skills

Recommended Agent Skills

vcf-annotator

chemist-analyst

bio-alignment-io

sleep-analyzer

metabolomics-workbench-database

bio-hi-c-analysis-matrix-operations