Agent skill
data-validation-join-explosion
Sub-skill of data-validation: Join Explosion (+6).
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/_archive/data/analytics/data-validation/join-explosion
SKILL.md
Join Explosion (+6)
Join Explosion
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)
How to prevent:
- Always check row counts after joins
- If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
- Use
COUNT(DISTINCT a.id)instead ofCOUNT(*)when counting entities through joins
Survivorship Bias
The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
- Analyzing user behavior of "current users" misses churned users
- Looking at "companies using our product" ignores those who evaluated and left
- Studying properties of "successful" outcomes without "unsuccessful" ones
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
Incomplete Period Comparison
The problem: Comparing a partial period to a full period.
Examples:
- "January revenue is $500K vs. December's $800K" -- but January isn't over yet
- "This week's signups are down" -- checked on Wednesday, comparing to a full prior week
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
Denominator Shifting
The problem: The denominator changes between periods, making rates incomparable.
Examples:
- Conversion rate improves because you changed how you count "eligible" users
- Churn rate changes because the definition of "active" was updated
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
Average of Averages
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
- Group A: 100 users, average revenue $50
- Group B: 10 users, average revenue $200
- Wrong: Average of averages = ($50 + $200) / 2 = $125
- Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
Timezone Mismatches
The problem: Different data sources use different timezones, causing misalignment.
Examples:
- Event timestamps in UTC vs. user-facing dates in local time
- Daily rollups that use different cutoff times
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
Selection Bias in Segmentation
The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
- "Users who completed onboarding have higher retention" -- obviously, they self-selected
- "Power users generate more revenue" -- they became power users BY generating revenue
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?