Agent skill

scientific-data-preprocessing

⚠️ CRITICAL USER EXPERIENCE-BASED SKILL - ALWAYS CONSULT BEFORE DATA PREPROCESSING ⚠️ Prevents catastrophic errors (88.9% error rate in V1.0 case study) through multi-level feature analysis, data leakage detection, and semantic validation. MANDATORY for: data preprocessing, feature engineering, standardization, normalization, interpolation, missing value handling, feature selection, or ANY data transformation task. Covers grouped time-series, cross-sectional, panel data. Detects: time travel leakage, causal inversion, ID misuse, semantic-numeric fallacies, distribution blindness. User's hard-won lessons from real project failures.

Stars 1,415
Forks 109

Install this agent skill to your Project

npx add-skill https://github.com/foryourhealth111-pixel/Vibe-Skills/tree/main/bundled/skills/scientific-data-preprocessing

SKILL.md

Scientific Data Preprocessing Skill

⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️

This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.

Why this skill is mandatory:

  • Based on actual project failures (V1.0, V2.0 case studies)
  • Prevents data leakage that causes production disasters
  • Catches semantic errors AI agents commonly make
  • Saves weeks of debugging and model retraining

When to invoke (DO NOT SKIP):

  • ✅ Before starting ANY data preprocessing task
  • ✅ During preprocessing for reflection and validation
  • ✅ After preprocessing for comprehensive audit
  • ✅ When reviewing AI-generated preprocessing code

Core Mission

Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.

When to Use This Skill

MANDATORY consultation - trigger immediately when:

Data Preprocessing Tasks (ALWAYS)

  • Any data cleaning, transformation, or preparation work
  • Loading and preparing data for modeling
  • Creating training/test splits
  • Handling missing values (imputation, deletion)
  • Feature scaling/normalization/standardization
  • Encoding categorical variables
  • Feature engineering or construction
  • Feature selection or dimensionality reduction

Data Structure Types (ALWAYS)

  • Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments)
  • Sports analytics (tennis, basketball, etc.)
  • Medical/clinical data with patient groupings
  • Panel data or longitudinal studies
  • Any grouped/hierarchical data structure

Quality Assurance (ALWAYS)

  • Auditing existing preprocessing for data leakage or semantic errors
  • Reviewing AI-generated preprocessing code for common pitfalls
  • Validating preprocessing before model training
  • Debugging unexpected model performance

Critical Checkpoints (NEVER SKIP)

  • BEFORE: Planning preprocessing strategy
  • DURING: Reflecting on decisions and checking for errors
  • AFTER: Comprehensive validation and audit

Trigger keywords that MUST invoke this skill:

  • "preprocess", "preprocessing", "data cleaning", "data preparation"
  • "standardize", "normalize", "scale", "transform"
  • "impute", "fill missing", "handle NaN"
  • "encode", "one-hot", "categorical"
  • "feature engineering", "feature selection", "feature construction"
  • "train test split", "cross validation split"
  • "interpolate", "smooth", "aggregate"

Not For / Boundaries

This skill does NOT:

  • Handle purely cross-sectional data (ungrouped, single timepoint)
  • Make domain-specific feature engineering decisions (you decide business logic)
  • Choose ML models (focuses on preprocessing only)
  • Handle distributed/big data infrastructure (assumes data fits in memory)

Required inputs before proceeding:

  1. Confirmation that data has groups (e.g., match_id, patient_id, session_id)
  2. Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison
  3. Domain constraints on data ranges/units

Quick Reference

Multi-Level Feature Analysis Framework

Level 1: Data Type

python
# Check data types
df.dtypes  # int64, float64, object, etc.

Level 2: Feature Type Classification

python
# Binary (0/1)
binary_features = [col for col in df.columns if df[col].nunique() == 2]

# Categorical (finite discrete values)
categorical_features = [col for col in df.select_dtypes(include='object').columns]

# Continuous (infinite possible values)
continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns
                       if df[col].nunique() > 10]

Level 3: Data Structure

python
# Check for grouping
print(f"Number of groups: {df['group_id'].nunique()}")
print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}")

# Check for time-series
df_sorted = df.sort_values(['group_id', 'timestamp'])

Level 4: Physical Meaning

python
# Validate physical ranges
assert df['speed_mph'].max() < 200, "Speed exceeds physical limit"
assert df['distance_meters'].min() >= 0, "Negative distance impossible"

Critical Processing Decision Tree

python
# Decision: Within-group or global processing?
def choose_processing_scope(data, feature, goal):
    """
    goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH")
    goal = 'absolute' → global (e.g., "this was an intense point OVERALL")
    """
    if goal == 'relative':
        return 'within_group'
    elif goal == 'absolute':
        return 'global'
    else:
        raise ValueError("Goal must be 'relative' or 'absolute'")

Pattern 1: Within-Group Interpolation (CORRECT)

python
from scipy.interpolate import CubicSpline
import numpy as np

# ✅ CORRECT: Interpolate within each group
for group_id in df['match_id'].unique():
    mask = df['match_id'] == group_id
    group_data = df.loc[mask, 'speed_mph'].copy()

    # Get valid (non-NaN) indices
    valid_idx = group_data.notna()
    valid_positions = np.where(valid_idx)[0]
    valid_values = group_data[valid_idx].values

    if len(valid_positions) >= 4:
        cs = CubicSpline(valid_positions, valid_values)
        missing_positions = np.where(~valid_idx)[0]
        df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

python
# ❌ WRONG: Cross-group interpolation
# This interpolates between match A's last point and match B's first point!
cs = CubicSpline(
    np.where(df['speed_mph'].notna())[0],  # ❌ All indices globally
    df['speed_mph'].dropna().values
)
df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs(
    np.where(df['speed_mph'].isna())[0]
)

Pattern 3: Within-Group Standardization (for Relative Analysis)

python
from sklearn.preprocessing import StandardScaler

# ✅ CORRECT: Standardize within each match
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    scaler = StandardScaler()

    df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform(
        df.loc[mask, [['distance_run']]
    )

# Interpretation: z=+2 means "2 std above average FOR THIS MATCH"

Pattern 4: Global Standardization (for Absolute Comparison)

python
# ✅ CORRECT: Global standardization (when appropriate)
scaler = StandardScaler()
df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']])

# Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"

Pattern 5: Feature Type Processing Rules

python
# Binary variables (0/1) - KEEP AS-IS
binary_cols = ['is_ace', 'is_winner', 'is_error']
# ❌ NEVER standardize these! They have semantic meaning as 0/1

# Categorical variables - ONE-HOT ENCODE
df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int)

# Continuous variables - STANDARDIZE (within-group or global)
continuous_cols = ['distance_run', 'rally_count', 'speed_mph']
# ✅ Apply pattern 3 or 4 based on goal

Pattern 6: Sliding Window Features (for Momentum)

python
# ✅ CORRECT: Sliding window for momentum analysis
window = 10

df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# ❌ WRONG: Cumulative features (loses temporal locality)
df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum()
# This just increases monotonically and correlates with point_number

Pattern 7: Data Quality Validation

python
def validate_data_quality(df, feature, expected_range):
    """Validate before processing"""
    # Check range
    assert df[feature].min() >= expected_range[0], f"{feature} below minimum"
    assert df[feature].max() <= expected_range[1], f"{feature} above maximum"

    # Check for anomalies
    mean = df[feature].mean()
    std = df[feature].std()

    if std > mean:
        print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)")

    # Check missing pattern
    missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum())
    if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5:
        print(f"⚠️ WARNING: {feature} has >50% missing in some groups")

# Example
validate_data_quality(df, 'speed_mph', expected_range=(50, 165))

Pattern 8: Detect Processing Scope Automatically

python
def detect_processing_scope(df, group_col, feature_col):
    """
    Recommend within-group vs global based on variance structure
    """
    # Calculate variance components
    within_group_var = df.groupby(group_col)[feature_col].var().mean()
    global_var = df[feature_col].var()

    # Intraclass correlation
    between_group_var = global_var - within_group_var
    icc = between_group_var / global_var

    if icc > 0.5:
        return 'within_group', f"High between-group variance (ICC={icc:.2f})"
    else:
        return 'global', f"Low between-group variance (ICC={icc:.2f})"

scope, reason = detect_processing_scope(df, 'match_id', 'distance_run')
print(f"Recommended: {scope} - {reason}")

Pattern 9: Data Leakage Detection

python
def detect_data_leakage(df, target_col, feature_cols, id_cols):
    """
    Critical checks for data leakage and AI common pitfalls
    """
    issues = []

    # 1. ID Leakage: High cardinality variables as features
    for col in feature_cols:
        if col in id_cols:
            issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature")
            continue

        # Check if looks like ID (>50% unique)
        uniqueness = df[col].nunique() / len(df)
        if uniqueness > 0.5:
            issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage")

    # 2. Causal Inversion: Perfect correlation with target
    for col in feature_cols:
        if col == target_col:
            continue
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[[col, target_col]].corr().iloc[0, 1])
            if corr > 0.95:
                issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!")

    # 3. Meaningless Numeric: Codes treated as numbers
    for col in feature_cols:
        if df[col].dtype in ['int64', 'float64']:
            # Pattern: High values, many uniques, looks like code
            if df[col].min() > 1000 and df[col].nunique() > 100:
                issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical")

    # 4. Time Travel: Check if standardization used global statistics
    # (Requires knowing if train/test split was done first)

    # Print report
    if issues:
        print("="*60)
        print("DATA LEAKAGE AUDIT")
        print("="*60)
        for issue in issues:
            print(issue)
        print("="*60)
    else:
        print("✅ No obvious leakage detected")

    return issues

# Example usage
issues = detect_data_leakage(
    df,
    target_col='point_won',
    feature_cols=['speed_mph', 'user_id', 'distance_run'],
    id_cols=['match_id', 'user_id']
)

Pattern 10: Distribution-Aware Scaling

python
from scipy.stats import skew, kurtosis
from sklearn.preprocessing import StandardScaler, RobustScaler

def smart_scaler_selection(df, col):
    """
    Choose scaler based on distribution characteristics
    """
    data = df[col].dropna()

    # Check distribution
    skewness = skew(data)
    kurt = kurtosis(data)

    print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}")

    if abs(skewness) < 0.5 and abs(kurt) < 3:
        # Roughly normal
        print("  → StandardScaler (data is roughly normal)")
        return StandardScaler(), None

    elif skewness > 1:
        # Right-skewed (long tail)
        print("  → Log transform + StandardScaler (right-skewed)")
        return StandardScaler(), 'log'

    else:
        # Heavy outliers or non-normal
        print("  → RobustScaler (heavy outliers)")
        return RobustScaler(), None

# Example usage
for col in continuous_features:
    scaler, transform = smart_scaler_selection(df, col)

    if transform == 'log':
        df[f'{col}_log'] = np.log1p(df[col])
        df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']])
    else:
        df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Input:

  • CSV with 7,284 rows, 31 matches
  • Features: speed_mph, distance_run, rally_count, is_ace, server
  • Goal: Analyze momentum (relative intensity within each match)

Steps:

python
import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1. Load and inspect
df = pd.read_csv('tennis_data.csv')
print(f"Matches: {df['match_id'].nunique()}")
print(f"Features: {df.dtypes}")

# 2. Classify features
binary_features = ['is_ace', 'is_winner', 'is_break_point']
categorical_features = ['server', 'serve_number']
continuous_features = ['distance_run', 'speed_mph', 'rally_count']

# 3. Validate data quality
for feat in continuous_features:
    print(f"\n{feat}:")
    print(df[feat].describe())
    # Check for impossible values
    if feat == 'speed_mph':
        assert df[feat].max() < 170, "Speed exceeds world record!"

# 4. Handle missing values (within-group)
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    for feat in continuous_features:
        if df.loc[mask, feat].isna().any():
            # Simple linear interpolation within match
            df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear')

# 5. One-hot encode categorical
df = pd.get_dummies(df, columns=categorical_features, dtype=int)

# 6. Standardize continuous features WITHIN each match
for feat in continuous_features:
    df[f'{feat}_std'] = np.nan
    for match_id in df['match_id'].unique():
        mask = df['match_id'] == match_id
        scaler = StandardScaler()
        df.loc[mask, f'{feat}_std'] = scaler.fit_transform(
            df.loc[mask, [[feat]]
        )

# 7. Create sliding window features
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# 8. KEEP binary features as 0/1 (don't transform!)
# binary_features are already correct

print("\n✅ Preprocessing complete!")
print(f"Final shape: {df.shape}")
print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")

Expected output:

  • Binary features remain 0/1
  • Categorical features one-hot encoded (e.g., server_1, server_2)
  • Continuous features have both original and _std versions
  • _std features have mean≈0, std≈1 WITHIN each match
  • Sliding window features capture local momentum
  • No missing values

Example 2: Detecting Cross-Group Contamination

Input:

  • Preprocessed data where you suspect cross-group standardization

Steps:

python
# Check if standardization was done correctly
def check_within_group_standardization(df, group_col, feature_std_col):
    """
    Verify that standardized feature has mean≈0, std≈1 within each group
    """
    results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std'])

    # Within-group standardization: each group should have mean≈0, std≈1
    if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all():
        print("✅ CORRECT: Within-group standardization detected")
        return True

    # Global standardization: groups will have varying means and stds
    else:
        print("❌ WRONG: Global standardization detected!")
        print("Group means:", results['mean'].values[:5])
        print("Group stds:", results['std'].values[:5])
        return False

check_within_group_standardization(df, 'match_id', 'distance_run_std')

Expected output:

  • CORRECT: All group means ≈ 0, all group stds ≈ 1
  • WRONG: Group means vary widely, indicating global standardization

Example 3: Fixing Cumulative Feature Error

Input:

  • Existing pipeline using cumulative sums for momentum

Steps:

python
# ❌ WRONG approach (existing code)
df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum()

# Problem: This just counts total wins up to this point
# Doesn't capture recent momentum!

# ✅ CORRECT approach (fix)
# Replace cumulative with sliding window
window = 10
df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# Compare
print("Cumulative (wrong):", df['cumulative_wins'].values[50:60])
print("Sliding window (correct):", df['recent_win_rate'].values[50:60])

# Cumulative: [25, 26, 26, 27, 28, ...] - monotonic
# Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum

Expected output:

  • Cumulative features removed
  • Sliding window features show local variations
  • Momentum analysis now captures short-term trends

References

  • references/index.md: Navigation and overview
  • references/error-case-studies.md: Real-world preprocessing disasters from tennis data
  • references/decision-trees.md: Full decision trees for all preprocessing choices
  • references/validation-checklist.md: Pre-processing validation checklist
  • references/ai-common-pitfalls.md: AI-specific errors (data leakage, semantic fallacies, distribution blindness)

Maintenance

⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️

This skill is NOT theoretical - it's based on real project failures:

  • V1.0 disaster: 88.9% error rate, weeks of wasted work
  • V2.0 issues: Cross-group contamination, unreliable results
  • V3.0 success: All errors fixed, production-ready

Why this matters to you (Claude):

  • These are the EXACT errors AI agents commonly make
  • User has already paid the price for these mistakes
  • Ignoring this skill = repeating documented failures
  • Following this skill = learning from experience without pain

Authority level: HIGHEST

  • Based on user's hard-won lessons from actual project
  • Validated through multiple iterations (V1.0 → V2.0 → V3.0)
  • Every error documented with impact metrics
  • Every fix validated with comprehensive testing

Sources:

  • Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis)
  • Secondary: Statistical best practices for grouped data
  • Tertiary: Common AI preprocessing errors observed across domains

Mandatory consultation:

  • ⚠️ ALWAYS consult before, during, and after any data preprocessing
  • ⚠️ NEVER skip validation steps outlined in this skill
  • ⚠️ When in doubt, err on the side of caution (use this skill)

Last updated: 2026-01-18 (V1.1)

Known limits:

  • Assumes data fits in memory (not for big data infrastructure)
  • Focused on numeric/categorical features (text/image preprocessing partially covered)
  • Does not prescribe domain-specific feature engineering (user decides business logic)
  • Requires basic understanding of statistics (mean, std, correlation)

Expand your agent's capabilities with these related and highly-rated skills.

foryourhealth111-pixel/Vibe-Skills

pufferlib

This skill should be used when working with reinforcement learning tasks including high-performance RL training, custom environment development, vectorized parallel simulation, multi-agent systems, or integration with existing RL environments (Gymnasium, PettingZoo, Atari, Procgen, etc.). Use this skill for implementing PPO training, creating PufferEnv environments, optimizing RL performance, or developing policies with CNNs/LSTMs.

1,415 109
Explore
foryourhealth111-pixel/Vibe-Skills

fluidsim

Framework for computational fluid dynamics simulations using Python. Use when running fluid dynamics simulations including Navier-Stokes equations (2D/3D), shallow water equations, stratified flows, or when analyzing turbulence, vortex dynamics, or geophysical flows. Provides pseudospectral methods with FFT, HPC support, and comprehensive output analysis.

1,415 109
Explore
foryourhealth111-pixel/Vibe-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

1,415 109
Explore
foryourhealth111-pixel/Vibe-Skills

build-error-resolver

Compatibility alias for build-specific error resolution. Use this when VCO routes to build-error-resolver but the upstream agent is unavailable in the current runtime.

1,415 109
Explore
foryourhealth111-pixel/Vibe-Skills

geniml

This skill should be used when working with genomic interval data (BED files) for machine learning tasks. Use for training region embeddings (Region2Vec, BEDspace), single-cell ATAC-seq analysis (scEmbed), building consensus peaks (universes), or any ML-based analysis of genomic regions. Applies to BED file collections, scATAC-seq data, chromatin accessibility datasets, and region-based genomic feature learning.

1,415 109
Explore
foryourhealth111-pixel/Vibe-Skills

zinc-database

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

1,415 109
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results