Agent skill

data-validation-join-explosion

Sub-skill of data-validation: Join Explosion (+6).

Stars 4
Forks 4

Install this agent skill to your Project

npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/_archive/data/analytics/data-validation/join-explosion

SKILL.md

Join Explosion (+6)

Join Explosion

The problem: A many-to-many join silently multiplies rows, inflating counts and sums.

How to detect:

sql
-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)

How to prevent:

  • Always check row counts after joins
  • If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
  • Use COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joins

Survivorship Bias

The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.

Examples:

  • Analyzing user behavior of "current users" misses churned users
  • Looking at "companies using our product" ignores those who evaluated and left
  • Studying properties of "successful" outcomes without "unsuccessful" ones

How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.

Incomplete Period Comparison

The problem: Comparing a partial period to a full period.

Examples:

  • "January revenue is $500K vs. December's $800K" -- but January isn't over yet
  • "This week's signups are down" -- checked on Wednesday, comparing to a full prior week

How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.

Denominator Shifting

The problem: The denominator changes between periods, making rates incomparable.

Examples:

  • Conversion rate improves because you changed how you count "eligible" users
  • Churn rate changes because the definition of "active" was updated

How to prevent: Use consistent definitions across all compared periods. Note any definition changes.

Average of Averages

The problem: Averaging pre-computed averages gives wrong results when group sizes differ.

Example:

  • Group A: 100 users, average revenue $50
  • Group B: 10 users, average revenue $200
  • Wrong: Average of averages = ($50 + $200) / 2 = $125
  • Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64

How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.

Timezone Mismatches

The problem: Different data sources use different timezones, causing misalignment.

Examples:

  • Event timestamps in UTC vs. user-facing dates in local time
  • Daily rollups that use different cutoff times

How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.

Selection Bias in Segmentation

The problem: Segments are defined by the outcome you're measuring, creating circular logic.

Examples:

  • "Users who completed onboarding have higher retention" -- obviously, they self-selected
  • "Power users generate more revenue" -- they became power users BY generating revenue

How to prevent: Define segments based on pre-treatment characteristics, not outcomes.

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results