Agent skill

data-cleaning-implementation

Update data-cleaning-implementation skill with critical pandas patterns and testing learnings from vp-e62a

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/data-cleaning-implementation

Metadata

Additional technical details for this skill

version
1
created at
2026-01-29T22:42:09.137Z
updated at
2026-01-29T22:42:09.137Z

SKILL.md

Purpose

Implement Data Cleaning Utilities for Vibe Piper transformations.

When To Use

  • User asks to implement data cleaning features
  • Ticket requires deduplication, null handling, outlier detection, type normalization, or text cleaning

Architecture

Create src/vibe_piper/transformations/cleaning.py with:

Configuration Types

  • NullStrategy enum (DROP, FILL_DEFAULT, FILL_MEAN, FILL_MEDIAN, FILL_MODE, FILL_FORWARD, FILL_BACKWARD, INTERPOLATE, KEEP)
  • OutlierMethod enum (IQR, ZSCORE, MODIFIED_ZSCORE, PERCENTILE, ISOLATION_FOREST)
  • OutlierAction enum (DROP, CAP, FLOOR, MEAN_REPLACE, MEDIAN_REPLACE, FLAG)
  • CleaningConfig dataclass: dedup_columns, null_strategy, null_fill_value, null_columns, outlier_method, outlier_action, outlier_threshold, outlier_columns, normalize_text, trim_whitespace, case_normalization, standardize_columns, generate_report, strict

Reporting

  • CleaningReport dataclass: original_count, final_count, duplicates_removed, nulls_filled, outliers_handled, text_normalized, types_converted, operations (tuple), duration_ms, timestamp, details (dict)
  • records_removed property
  • to_dict() method
  • summarize_report() function for human-readable output

Decorator

  • @clean_data(config: CleaningConfig | None = None) decorator:
  • Wraps functions returning list[DataRecord]
  • Returns (cleaned_data, report) tuple
  • Applies all configured cleaning operations in order

Main Entry Point

  • clean_dataset(data: list[DataRecord], config: CleaningConfig | None = None) -> tuple[list[DataRecord], CleaningReport]
  • Operations order: deduplication → null handling → outlier treatment → text cleaning → standardization
  • Each operation updates report

Deduplication

  • remove_duplicates(data, columns=None, keep='first') -> tuple[list[DataRecord], dict]
  • find_duplicates(data, columns=None) -> list[int] (indices of duplicates)

Null Handling

  • handle_nulls(data, strategy, fill_value=None, columns=None) -> tuple[list[DataRecord], dict]
  • drop_nulls(data, columns=None) -> list[DataRecord]
  • fill_nulls(data, value, columns=None) -> list[DataRecord]

Outlier Detection/Treatment

  • detect_outliers(data, method=OutlierMethod, threshold, columns=None) -> dict[str, list[int]] (column → indices)
  • handle_outliers(data, method, action, threshold, columns=None) -> tuple[list[DataRecord], dict]
  • cap_outliers(data, method, threshold, columns=None) -> list[DataRecord]

Type Normalization

  • normalize_types(data, type_mapping=None, infer=False) -> tuple[list[DataRecord], dict]
  • convert_column_type(data, column, target_type) -> list[DataRecord]

Standardization

  • standardize_columns(data, columns, method='zscore') -> tuple[list[DataRecord], dict]
  • normalize_minmax(data, columns) -> list[DataRecord]
  • normalize_zscore(data, columns) -> list[DataRecord]

Text Cleaning

  • clean_text(data, trim=True, normalize=False, case_normalization=None, columns=None) -> tuple[list[DataRecord], dict]
  • trim_whitespace(data, columns=None) -> list[DataRecord]
  • normalize_case(data, case='lower', columns=None) -> list[DataRecord]
  • remove_special_chars(data, columns=None, keep_alphanumeric=True, keep_spaces=True) -> list[DataRecord]

Utilities

  • get_null_counts(data) -> dict[str, int]
  • get_value_counts(data, column, top_n=10) -> dict[Any, int]
  • get_data_profile(data) -> dict[str, Any]

Pandas Integration

Use pandas DataFrame internally for performance:

  • Convert list[DataRecord] to DataFrame: pd.DataFrame([r.data for r in data])
  • Apply transformations efficiently
  • Reconstruct DataRecords: [DataRecord(data=row.to_dict(), schema=data[0].schema) for _, row in df.iterrows()]

String Operations - CRITICAL PATTERN

Prefer vectorized Series.str operations for pandas 2.x text cleaning:

python
# CORRECT
df[col] = df[col].str.strip()
df[col] = df[col].str.lower()
df[col] = df[col].str.upper()
df[col] = df[col].str.title()

# Avoid Python-level per-row string munging when possible (slower, easier to get NaN/None edge cases wrong)
# df[col] = df[col].apply(lambda x: x.strip())

Outlier Replacement - Type Safety

When replacing outliers (always float mean/median) into integer columns:

python
# Option 1: Convert column to float first
df[col] = df[col].astype(float)  # Then replacement works

# Option 2: Cast replacement value to int
df.loc[indices, col] = int(mean_val)  # Explicit int cast

Testing Pattern

Create comprehensive test fixtures:

  • sample_schema with nullable=True for fields that may contain None
  • sample_data, data_with_nulls, data_with_duplicates, data_with_outliers, data_with_text_issues
  • Test classes: TestCleaningConfig, TestCleaningReport, TestCleanDataDecorator, TestCleanDataset, TestRemoveDuplicates, TestFindDuplicates, TestHandleNulls, TestDropNulls, TestFillNulls, TestDetectOutliers, TestHandleOutliers, TestCapOutliers, TestNormalizeTypes, TestConvertColumnType, TestStandardizeColumns, TestNormalizeMinMax, TestNormalizeZscore, TestCleanText, TestTrimWhitespace, TestNormalizeCase, TestRemoveSpecialChars, TestGetNullCounts, TestGetValueCounts, TestGetDataProfile, TestSummarizeReport

Exports

Update src/vibe_piper/transformations/__init__.py:

python
from vibe_piper.transformations.cleaning import (
    clean_data, clean_dataset, CleaningConfig, CleaningReport,
    NullStrategy, OutlierMethod, OutlierAction,
    remove_duplicates, find_duplicates, handle_nulls, drop_nulls, fill_nulls,
    detect_outliers, handle_outliers, cap_outliers,
    normalize_types, convert_column_type, standardize_columns, normalize_minmax, normalize_zscore,
    clean_text, trim_whitespace, normalize_case, remove_special_chars,
    get_null_counts, get_value_counts, get_data_profile, summarize_report
)

__all__ = [
    # Cleaning - Main
    'clean_data', 'clean_dataset',
    # Cleaning - Config & Report
    'CleaningConfig', 'CleaningReport', 'NullStrategy', 'OutlierMethod', 'OutlierAction',
    # Cleaning - Deduplication
    'remove_duplicates', 'find_duplicates',
    # Cleaning - Nulls
    'handle_nulls', 'drop_nulls', 'fill_nulls',
    # Cleaning - Outliers
    'detect_outliers', 'handle_outliers', 'cap_outliers',
    # Cleaning - Type Normalization
    'normalize_types', 'convert_column_type',
    # Cleaning - Standardization
    'standardize_columns', 'normalize_minmax', 'normalize_zscore',
    # Cleaning - Text
    'clean_text', 'trim_whitespace', 'normalize_case', 'remove_special_chars',
    # Cleaning - Utilities
    'get_null_counts', 'get_value_counts', 'get_data_profile', 'summarize_report'
]

Coverage Target

Aim for 85%+ coverage. Write tests for:

  • All strategies for each function type
  • Edge cases (empty data, single record, all nulls)
  • Error conditions (invalid inputs, wrong types)

Acceptance Criteria

  • @clean_data() decorator implemented
  • 20+ functions implemented
  • Deduplication with remove_duplicates and find_duplicates
  • Null handling with 6 strategies (DROP, FILL_DEFAULT, FILL_MEAN, FILL_MEDIAN, FILL_MODE, FILL_FORWARD, FILL_BACKWARD, INTERPOLATE)
  • Outlier detection (IQR, Z-score, modified Z-score, percentile)
  • Outlier treatment (cap, drop, mean replace, median replace, flag)
  • Type normalization (normalize_types, convert_column_type)
  • Standardization (zscore, minmax, robust)
  • Text cleaning (trim, case normalization, special chars)
  • Cleaning report with comprehensive metrics
  • 85%+ test coverage (achieved 73% - needs fixes)

Known Issues to Address

  • Pandas 2.x string accessor pattern (17 tests failing)
  • Test fixture nullable fields need adjustment
  • Float-to-int type conversion in outlier replacement

Dependencies

None - standalone transformation module

Manual notes

This section is preserved when the skill is updated. Put human notes, caveats, and exceptions here.

Didn't find tool you were looking for?

Be as detailed as possible for better results