Agent skills
senior-data-scientist

Agent skill

senior-data-scientist

View SKILL.md on GitHub Repository

Stars 71

Forks 21

Install this agent skill to your Project

npx add-skill https://github.com/borghei/Claude-Skills/tree/main/engineering/senior-data-scientist

Metadata

Additional technical details for this skill

tags: data-science ml statistics experimentation python mlops
author: borghei
domain: data-science
updated: 1774915200
version: 1.0.0
category: engineering

SKILL.md

Senior Data Scientist

Expert data science for statistical modeling, experimentation, ML deployment, and data-driven decision making.

Keywords

data-science, machine-learning, statistics, a-b-testing, causal-inference, feature-engineering, mlops, experiment-design, model-deployment, python, scikit-learn, pytorch, tensorflow, spark, airflow

Quick Start

bash

# Design an experiment with power analysis
python scripts/experiment_designer.py --input data/ --output results/

# Run feature engineering pipeline
python scripts/feature_engineering_pipeline.py --target project/ --analyze

# Evaluate model performance
python scripts/model_evaluation_suite.py --config config.yaml --deploy

# Statistical analysis
python scripts/statistical_analyzer.py --data input.csv --test ttest --output report.json

Tools

Script	Purpose
`scripts/experiment_designer.py`	A/B test design, power analysis, sample size calculation
`scripts/feature_engineering_pipeline.py`	Automated feature generation, correlation analysis, feature selection
`scripts/statistical_analyzer.py`	Hypothesis testing, causal inference, regression analysis
`scripts/model_evaluation_suite.py`	Model comparison, cross-validation, deployment readiness checks

Tech Stack

Category	Tools
Languages	Python, SQL, R, Scala
ML Frameworks	PyTorch, TensorFlow, Scikit-learn, XGBoost
Data Processing	Spark, Airflow, dbt, Kafka, Databricks
Deployment	Docker, Kubernetes, AWS SageMaker, GCP Vertex AI
Experiment Tracking	MLflow, Weights & Biases
Databases	PostgreSQL, BigQuery, Snowflake, Pinecone

Workflow 1: Design and Analyze an A/B Test

Define hypothesis -- State the null and alternative hypotheses. Identify the primary metric (e.g., conversion rate, revenue per user).
Calculate sample size -- python scripts/experiment_designer.py --input data/ --output results/
- Specify minimum detectable effect (MDE), significance level (alpha=0.05), and power (0.80).
- Example: For baseline conversion 5%, MDE 10% relative lift, need ~31,000 users per variant.
Randomize assignment -- Use hash-based assignment on user ID for deterministic, reproducible splits.
Run experiment -- Monitor for sample ratio mismatch (SRM) daily. Flag if observed ratio deviates >1% from expected.

Analyze results:

python

from scipy import stats

# Two-proportion z-test for conversion rates
control_conv = control_successes / control_total
treatment_conv = treatment_successes / treatment_total
z_stat, p_value = stats.proportions_ztest(
    [treatment_successes, control_successes],
    [treatment_total, control_total],
    alternative='two-sided'
)
# Reject H0 if p_value < 0.05

Validate -- Check for novelty effects, Simpson's paradox across segments, and pre-experiment balance on covariates.

Workflow 2: Build a Feature Engineering Pipeline

Profile raw data -- python scripts/feature_engineering_pipeline.py --target project/ --analyze
- Identify null rates, cardinality, distributions, and data types.
Generate candidate features:
- Temporal: day-of-week, hour, recency, frequency, monetary (RFM)
- Aggregation: rolling means/sums over 7d/30d/90d windows
- Interaction: ratio features, polynomial combinations
- Text: TF-IDF, embedding vectors
Select features -- Remove features with >95% null rate, near-zero variance, or >0.95 pairwise correlation. Use recursive feature elimination or SHAP importance.
Validate -- Confirm no target leakage (no features derived from post-outcome data). Check train/test distribution alignment.
Register -- Store features in feature store with versioning and lineage metadata.

Workflow 3: Train and Evaluate a Model

Split data -- Stratified train/validation/test split (70/15/15). For time series, use temporal split (no future leakage).
Train baseline -- Start with a simple model (logistic regression, gradient boosted trees) to establish a benchmark.
Tune hyperparameters -- Use Optuna or cross-validated grid search. Log all runs to MLflow.

Evaluate on held-out test set:

python

from sklearn.metrics import classification_report, roc_auc_score

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

Validate -- Check calibration (predicted probabilities match observed rates). Evaluate fairness metrics across protected groups. Confirm no overfitting (train vs test gap <5%).

Workflow 4: Deploy a Model to Production

Containerize -- Package model with inference dependencies in Docker:
bash
```
docker build -t model-service:v1 .
```
Set up serving -- Deploy behind a REST API with health check, input validation, and structured error responses.
Configure monitoring:
- Input drift: compare incoming feature distributions to training baseline (KS test, PSI)
- Output drift: monitor prediction distribution shifts
- Performance: track latency P50/P95/P99 targets (<50ms / <100ms / <200ms)
Enable canary deployment -- Route 5% traffic to new model, compare metrics against baseline for 24-48 hours.
Validate -- python scripts/model_evaluation_suite.py --config config.yaml --deploy confirms serving latency, error rate <0.1%, and model outputs match offline evaluation.

Workflow 5: Perform Causal Inference

Assess assignment mechanism -- Determine if treatment was randomized (use experiment analysis) or observational (use causal methods below).
Select method based on data structure:
- Propensity Score Matching: when treatment is binary, many covariates available
- Difference-in-Differences: when pre/post data available for treatment and control groups
- Regression Discontinuity: when treatment assigned by threshold on running variable
- Instrumental Variables: when unobserved confounding present but valid instrument exists
Check assumptions -- Parallel trends (DiD), overlap/positivity (PSM), continuity (RDD).
Estimate treatment effect and compute confidence intervals.
Validate -- Run placebo tests (apply method to pre-treatment period, expect null effect). Sensitivity analysis for unobserved confounding.

Performance Targets

Metric	Target
P50 latency	< 50ms
P95 latency	< 100ms
P99 latency	< 200ms
Throughput	> 1,000 req/s
Availability	99.9%
Error rate	< 0.1%

Common Commands

bash

# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py

Reference Documentation

Document	Path
Statistical Methods	references/statistical_methods_advanced.md
Experiment Design Frameworks	references/experiment_design_frameworks.md
Feature Engineering Patterns	references/feature_engineering_patterns.md
Automation Scripts	`scripts/` directory

Troubleshooting

Problem	Cause	Solution
Sample size calculation returns unreasonably large numbers	Minimum detectable effect (MDE) is set too small relative to baseline variance	Increase MDE to a practically meaningful threshold or accept longer experiment duration
Feature pipeline reports high null rates across all generated features	Source data contains upstream ingestion gaps or schema drift	Validate raw data completeness before running the pipeline; check ETL logs for failed loads
Model AUC drops significantly on validation vs. training set	Overfitting due to high-cardinality features or insufficient regularization	Apply stronger regularization, reduce feature set, or increase training data volume
Experiment shows significant results but large confidence intervals	Insufficient sample size or high metric variance	Extend experiment runtime, increase traffic allocation, or switch to a variance-reduction technique (CUPED)
Deployed model latency exceeds P95 targets	Model complexity too high for serving infrastructure or missing batching	Quantize the model, reduce input feature count, or enable request batching on the serving layer
Feature importance scores are unstable across cross-validation folds	Correlated features cause importance to shift between redundant predictors	Remove highly correlated features (>0.95) before training or use permutation importance with repeated runs
Causal inference estimates show implausible treatment effects	Violation of parallel trends assumption (DiD) or poor covariate overlap (PSM)	Run diagnostic tests (placebo checks, overlap histograms) and consider alternative identification strategies

Success Criteria

Model discrimination: AUC-ROC above 0.85 on held-out test set for classification tasks
Calibration quality: Brier score below 0.15; predicted probabilities within 5% of observed rates across decile bins
Feature coverage: Feature importance analysis accounts for at least 90% of cumulative model importance
Experiment power: All A/B tests designed with statistical power of 0.80 or higher at the specified MDE
Deployment readiness: Model serving latency under 100ms at P95; error rate below 0.1%
Reproducibility: All experiments and training runs logged with full parameter tracking; results reproducible from logged artifacts
Data quality: Input feature pipelines maintain less than 2% null rate on critical features after imputation

Scope & Limitations

This skill covers:

End-to-end experiment design including power analysis, randomization, and post-hoc analysis
Feature engineering pipelines with profiling, generation, selection, and validation
Model training evaluation including cross-validation, calibration, and fairness checks
Production model deployment with monitoring, drift detection, and canary rollouts

This skill does NOT cover:

Data engineering infrastructure (ETL orchestration, pipeline scheduling, data lake management) -- see senior-data-engineer
Deep learning model architecture design and training at scale (distributed GPU training, custom layers) -- see senior-ml-engineer
Prompt engineering, RAG systems, and LLM fine-tuning workflows -- see senior-prompt-engineer
Computer vision pipelines (object detection, segmentation, video processing) -- see senior-computer-vision

Integration Points

Skill	Integration	Data Flow
`senior-data-engineer`	Feature pipeline ingests data from ETL outputs; shares data quality validation patterns	Raw data stores --> feature engineering pipeline --> feature store
`senior-ml-engineer`	Trained models handed off for MLOps deployment; shares model registry and serving configs	Evaluated model artifacts --> deployment pipeline --> production serving
`senior-prompt-engineer`	Embedding features from LLMs feed into ML pipelines; experiment frameworks apply to prompt A/B tests	LLM embeddings --> feature vectors; experiment designs --> prompt evaluation
`senior-architect`	Model serving architecture reviewed for scalability; data platform design aligned with training infrastructure	Architecture specs --> deployment topology --> monitoring dashboards
`senior-backend`	Model inference endpoints integrated into backend services; API contracts defined for prediction requests	REST/gRPC model API --> backend service layer --> client applications
`senior-devops`	CI/CD pipelines extended for model retraining triggers; containerized model images deployed via infrastructure-as-code	Docker images --> Kubernetes manifests --> production clusters

Tool Reference

experiment_designer.py

Purpose: A/B test design, statistical power analysis, and sample size calculation. Validates experiment configuration and produces structured results with timestamps.

Usage:

bash

python scripts/experiment_designer.py --input data/ --output results/

Flags/Parameters:

Flag	Short	Required	Description
`--input`	`-i`	Yes	Input path (directory or file containing experiment data)
`--output`	`-o`	Yes	Output path (directory for results)
`--config`	`-c`	No	Path to configuration file (YAML or JSON)
`--verbose`	`-v`	No	Enable verbose (DEBUG-level) logging output

Example:

bash

python scripts/experiment_designer.py -i data/experiment_config/ -o results/power_analysis/ -c config.yaml -v

Output Format: JSON to stdout with the following structure:

json

{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}

feature_engineering_pipeline.py

Purpose: Automated feature generation, correlation analysis, and feature selection. Profiles raw data, generates candidate features, and validates for target leakage.

Usage:

bash

python scripts/feature_engineering_pipeline.py --input data/ --output features/

Flags/Parameters:

Flag	Short	Required	Description
`--input`	`-i`	Yes	Input path (directory or file containing raw data)
`--output`	`-o`	Yes	Output path (directory for generated features)
`--config`	`-c`	No	Path to configuration file (YAML or JSON)
`--verbose`	`-v`	No	Enable verbose (DEBUG-level) logging output

Example:

bash

python scripts/feature_engineering_pipeline.py -i data/raw/ -o features/v2/ -v

Output Format: JSON to stdout with the following structure:

json

{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}

model_evaluation_suite.py

Purpose: Model comparison, cross-validation, and deployment readiness checks. Validates serving latency, error rates, and confirms model outputs match offline evaluation.

Usage:

bash

python scripts/model_evaluation_suite.py --input models/ --output evaluation/

Flags/Parameters:

Flag	Short	Required	Description
`--input`	`-i`	Yes	Input path (directory or file containing model artifacts)
`--output`	`-o`	Yes	Output path (directory for evaluation results)
`--config`	`-c`	No	Path to configuration file (YAML or JSON)
`--verbose`	`-v`	No	Enable verbose (DEBUG-level) logging output

Example:

bash

python scripts/model_evaluation_suite.py -i models/xgb_v3/ -o evaluation/report/ -c prod_config.yaml

Output Format: JSON to stdout with the following structure:

json

{
  "status": "completed",
  "start_time": "2026-03-21T10:00:00.000000",
  "processed_items": 0,
  "end_time": "2026-03-21T10:00:01.000000"
}

Note: The Tools table references scripts/statistical_analyzer.py but this script does not yet exist in the repository. Statistical analysis workflows described in the SKILL.md can be performed using inline Python (scipy, statsmodels) as shown in Workflow 1 and Workflow 5.

Maintainer

borghei Core maintainer

Source details

Full Name: borghei/Claude-Skills
Branch: main
Path in repo: engineering/senior-data-scientist
License: Other
Topics: claude-code automation ai-agents cursor developer-tools agentic-coding github-copilot prompt-engineering llm python ai-coding-assistant ai-skills windsurf openai-codex compliance-automation eu-ai-act gdpr-compliance iso-27001 role-based-agents soc2

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

borghei/Claude-Skills

churn-prevention

SaaS churn reduction covering cancel flow design, dynamic save offers, exit survey architecture, dunning sequences, payment recovery, win-back campaigns, and churn impact modeling.

71 21

Explore

borghei/Claude-Skills

popup-cro

Popup and modal optimization for conversion. Covers exit-intent, slide-ins, banners, timing optimization, frequency capping, audience targeting, compliance, and A/B testing frameworks for lead capture, promotions, and announcements.

71 21

Explore

borghei/Claude-Skills

competitor-alternatives

Competitor comparison and alternative page creation for SEO and sales enablement. Covers 4 page formats (singular alternative, plural alternatives, vs pages, competitor vs competitor), content architecture, research methodology, and centralized competitor data management.

71 21

Explore

borghei/Claude-Skills

contract-and-proposal-writer

Generate production-ready business documents including freelance contracts, project proposals, SOWs, NDAs, and MSAs with jurisdiction-aware clauses. Covers US (Delaware), EU (GDPR), UK, and DACH (German law) legal frameworks. Includes contract templates, clause libraries, and DOCX conversion. Use when starting client engagements, writing proposals, drafting partnership agreements, or needing GDPR-compliant data processing addenda.

71 21

Explore

borghei/Claude-Skills

pricing-strategy

SaaS pricing design and optimization covering value metric selection, tier architecture, price point research, pricing page design, price increase execution, and competitive pricing analysis.

71 21

Explore

borghei/Claude-Skills

referral-program

Referral and affiliate program design covering referral loop architecture, incentive design, trigger moment optimization, viral coefficient modeling, affiliate program structure, and optimization playbook.

71 21

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Senior Data Scientist

Keywords

Quick Start

Tools

Tech Stack

Workflow 1: Design and Analyze an A/B Test

Workflow 2: Build a Feature Engineering Pipeline

Workflow 3: Train and Evaluate a Model

Workflow 4: Deploy a Model to Production

Workflow 5: Perform Causal Inference

Performance Targets

Common Commands

Reference Documentation

Troubleshooting

Success Criteria

Scope & Limitations

Integration Points

Tool Reference

experiment_designer.py

feature_engineering_pipeline.py

model_evaluation_suite.py

Recommended Agent Skills

churn-prevention

popup-cro

competitor-alternatives

contract-and-proposal-writer

pricing-strategy

referral-program