Agent skills
root-cause-diagnosis

Agent skill

root-cause-diagnosis

Use when metrics drop or spike unexpectedly - systematically investigates using 4-dimension segmentation (People, Geography, Technology, Time), intrinsic vs extrinsic factors, and hypothesis table method to identify root cause

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/root-cause-diagnosis

SKILL.md

Root Cause Diagnosis

Purpose

Systematically narrow down the root cause of unexpected metric changes by segmenting data, distinguishing internal vs. external factors, and testing hypotheses against observed patterns. Prevents jumping to conclusions and ensures evidence-based decision-making.

When to Use This Skill

Activate automatically when:

Key metrics drop or spike unexpectedly
A/B test results show unexpected patterns
User behavior changes suddenly
Post-mortem analysis needed after incident
Leadership asks "what happened?"
metric-diagnosis workflow investigates metric changes
Need to validate suspected causes

When NOT to use:

Change was expected (planned feature release, known seasonality)
Metric is within normal variance
Root cause is already confirmed
Data quality issue is obvious (reporting broken)

The Iron Law

ASK CLARIFYING QUESTIONS FIRST - DON'T GUESS

Most metric problems are more narrow than initially stated. The problem might only affect:

Specific user segments
Certain geographies
Particular platforms
Limited time windows

Narrow the scope before hypothesizing causes.

The 4 Critical Dimensions

Always segment data across these dimensions:

Dimension 1: People (User Segments)

Questions to ask:

Does it affect all users equally?
Specific age demographics?
New vs. returning users?
Free vs. paid users?
Business vs. consumer users?
Power users vs. casual users?
Specific personas or cohorts?

Example: "Snapchat usage down 10%" → Actually: "Down among kids in Australia, not adults or other regions"

Dimension 2: Geography

Questions to ask:

Specific countries affected?
Certain cities or regions?
Urban vs. rural areas?
Climate zones?
Time zones?
Regulatory regions (GDPR, etc.)?

Example: "Square usage drops steadily" → Actually: "Only in Boston in fall, not Atlanta" (weather: farmers markets close)

Dimension 3: Technology (Platform)

Questions to ask:

iOS vs. Android?
Web vs. mobile app?
Desktop vs. mobile web?
Specific browser versions?
Operating system versions?
Device types (phone, tablet)?
Network types (WiFi, cellular)?

Example: "Microsoft Teams downloads down 50%" → Actually: "Web and desktop, not mobile; all browsers equally"

Dimension 4: Time

Questions to ask:

When did it start exactly?
Is it constant or fluctuating?
Day of week patterns?
Time of day patterns?
Seasonal effects?
Holiday impacts?
Gradual decline or sudden drop?

Example: "Tinder usage dropped" → Actually: "Among professionals in metro areas overnight" (competing app launched)

Intrinsic vs. Extrinsic Factors

After narrowing scope, brainstorm potential causes in both categories:

Intrinsic Factors (Internal)

Definition: Changes you or your company made

Examples:

Bugs introduced
New features deployed
UI/UX changes
Algorithm adjustments
A/B experiments running
Policy changes
Pricing changes
Server performance issues
Cross-team feature launches

Investigation approach:

Review recent releases (past week/month)
Check A/B test logs
Talk to engineering about deployments
Review other team launches
Check server monitoring

Extrinsic Factors (External)

Definition: Changes in the world outside your control

Examples:

Competitor actions: New app launched, pricing change, marketing campaign
Economic factors: Recession, inflation, job market
Seasonal patterns: Weather, holidays, school schedules
External events: News events, cultural moments, pandemics
Market shifts: User behavior trends, platform changes
Regulatory changes: New laws, compliance requirements

Investigation approach:

Monitor competitor announcements
Review industry news
Check economic indicators
Consider calendar/seasonal effects
Talk to sales and customer success teams

The Hypothesis Table Method

Purpose: Test multiple hypotheses against data systematically

Step 1: Create Table Structure

Columns: Data points you observed

"Kids usage DOWN"
"Adults usage UNCHANGED"
"US affected"
"Europe unaffected"
"iOS affected"
"Android unaffected"

Rows: Potential causes (2-3 intrinsic, 2-3 extrinsic)

Step 2: Fill in Predictions

For each cause-data intersection:

↑ = This cause would increase this metric
↓ = This cause would decrease this metric
→ = This cause wouldn't affect this metric
? = Uncertain prediction

Step 3: Match Patterns

Look for rows where ALL predictions match observed data:

If predictions align with observations → Possible cause
If predictions contradict observations → Rule out

Step 4: Request Differentiating Data

If multiple causes fit:

Ask for additional data points that differ between hypotheses
Narrow to single cause through additional segmentation

Example Table: Snapchat Usage Down

Cause	Kids ↓	Adults →	US ↓	EU →	iOS ↓	Android →
iOS bug in latest update	✓ ?	✓ ?	✗ All	✗ All	✓ YES	✓ NO
School started (US)	✓ YES	✓ NO	✓ YES	✓ NO	✗ All	✗ All
TikTok marketing to teens	✓ YES	✓ NO	✗ Global	✗ Global	✗ All	✗ All

Pattern match: "School started (US)" fits all observations → Likely cause

Systematic Investigation Sequence

Follow this order to maximize efficiency:

Phase 1: Data Quality Verification (5 minutes)

Before investigating product issues, confirm data is real:

Check reporting systems working
Compare multiple data sources
Verify tracking not broken
Confirm sample sizes sufficient

Red flags:

Metric went to exactly zero (likely tracking broken)
Change happened at exact midnight (likely pipeline issue)
Only affects one data source

Phase 2: Narrow the Scope (15 minutes)

Ask 4-dimension clarifying questions:

People: Which user segments?
Geography: Which locations?
Technology: Which platforms?
Time: When exactly, what pattern?

Output: Narrow problem statement (not "usage down" but "iOS teen usage in US down 10% starting Sept 1")

Phase 3: Internal Investigation (30 minutes)

Check intrinsic factors first (easier to fix):

Recent releases by your team (past week)
A/B experiments running
Configuration changes
Cross-team launches affecting you
Infrastructure/performance issues

Stakeholders to consult:

Engineering: Recent deploys, bugs reported
Data: Tracking changes, pipeline issues
Product: Other team launches

Phase 4: External Investigation (30 minutes)

Check extrinsic factors if internal ruled out:

Competitor news and launches
Industry events and trends
Seasonal/calendar effects
Economic indicators
External PR or news

Stakeholders to consult:

Sales: Customer feedback, competitive intel
Marketing: Campaign timing, market trends
Customer Success: Support tickets, user complaints

Phase 5: Hypothesis Testing (Ongoing)

Build and test hypotheses:

List 4-6 potential causes (mix intrinsic/extrinsic)
Create hypothesis table
Predict impact on each data segment
Match predictions to observations
Request additional data to differentiate
Narrow to 1-2 most likely causes

Common Diagnostic Patterns

Pattern 1: Gradual Decline Over Weeks

Likely causes:

Competitor gaining market share
Degrading product experience (tech debt, performance)
Seasonal trend
User behavior shift

Investigation:

Compare to competitor growth rates
Check performance metrics trends
Review year-over-year seasonality
Survey users about satisfaction

Pattern 2: Sudden Overnight Drop

Likely causes:

Bug introduced in recent release
Breaking change in API/integration
Competitor launched major feature
External event (news, regulation)

Investigation:

Review deployments in past 24-48 hours
Check error logs and monitoring
Scan news and competitor announcements
Verify tracking still functional

Pattern 3: Recurring Pattern (Weekly, Daily)

Likely causes:

Usage behavior patterns (weekday vs. weekend)
Scheduled batch processes
Time zone effects
Cultural/lifestyle patterns

Investigation:

Compare to historical patterns (is this new?)
Check across multiple weeks
Segment by geography and user type

Pattern 4: Segment-Specific Issue

Likely causes:

Platform-specific bug
Demographic-targeted competitor
Localized event or regulation
A/B test gone wrong

Investigation:

Reproduce issue on affected segment
Check A/B test assignments
Review segment-specific deployments
Talk to affected users directly

Stakeholder Consultation Strategy

Purpose: Gather insights from teams with different data access

When to Talk to Sales

Best for:

Enterprise customer behavior changes
Competitive intelligence
Renewal and churn patterns
Feature request trends

Questions to ask:

"Any customers mentioning this issue?"
"Have you heard about competitor changes?"
"Any recent losses to specific competitors?"

When to Talk to Customer Success

Best for:

Support ticket trends
User sentiment and feedback
Onboarding friction points
Feature adoption patterns

Questions to ask:

"Any spike in support tickets?"
"What are users saying about [feature]?"
"Any common complaints this week?"

When to Talk to Marketing

Best for:

Campaign timing and impact
Promotional effects
Brand perception changes
Acquisition channel performance

Questions to ask:

"Any campaigns end recently?"
"Changes in acquisition costs?"
"New channels tested?"

When to Talk to Engineering

Best for:

Recent deployments and changes
Performance and reliability issues
A/B test implementations
Infrastructure events

Questions to ask:

"Any deploys in past 48 hours?"
"Performance regressions?"
"Error rate changes?"

Common Mistakes

Mistake	Fix
Jumping to conclusions without data	Ask 4-dimension questions first
Only considering intrinsic factors	Always brainstorm extrinsic factors too
Accepting first hypothesis that fits	Test multiple hypotheses systematically
Not validating data quality	Check reporting systems before investigating
Ignoring stakeholder insights	Consult sales, CS, engineering teams
Binary thinking (ship/rollback only)	Explore mitigation strategies first

Anti-Rationalization Blocks

Rationalization	Reality
"It's obviously a bug"	Could be competitor, seasonality, user behavior
"Must be the release yesterday"	Ask segmentation questions before assuming
"This only happens to us"	Check if competitors experiencing same
"Too complicated to investigate"	Systematic process makes it manageable
"Let's just rollback everything"	Understand cause before reversing changes

Success Criteria

Root cause diagnosis succeeds when:

Problem narrowed to specific segments (4 dimensions)
4-6 hypotheses brainstormed (intrinsic + extrinsic)
Hypothesis table created with predictions
Patterns matched against observed data
Most likely cause identified (or top 2-3 candidates)
Data quality verified
Stakeholders consulted for insights
Recommended next steps clear (fix, monitor, mitigate)

Real-World Examples

Example 1: Microsoft Teams Downloads Down 50%

Initial statement: "New app downloads down 50% from last week"

Phase 1: Narrow the scope

Geography: Predominantly US and Europe
Users: Enterprise customers, not consumers
Platform: Web and desktop primarily, some mobile
Time: Started last week, consistent since
Other metrics: Usage unchanged (existing users still active)

Phase 2: Hypothesis brainstorming

Intrinsic factors:

Bug in download flow (dismissed - all browsers affected equally)
Recent release broke something (dismissed - no recent release)
A/B test reducing visibility (dismissed - not running experiments)

Extrinsic factors:

Marketing promotion ended (MATCH!)
Competitor launch (dismissed - no major launches)
Economic downturn (dismissed - usage unchanged)

Phase 3: Investigation

Talk to marketing: "We were running 50% off promotion for enterprise customers switching from Slack"
Promotion ended last weekend
Downloads during promo = artificially inflated baseline
Current downloads = normal baseline

Root cause: End of promotional campaign created false "decline" (actually return to normal)

Action: None needed - metric will stabilize. Use normal baseline for future comparisons.

Example 2: Snapchat Usage Drop

Initial statement: "Snapchat usage down 10%"

Phase 1: Narrow the scope

People: Kids affected, adults unchanged
Geography: US affected, Europe unaffected
Platform: Both iOS and Android
Time: Started September 1st

Phase 2: Hypothesis table

Cause	Kids ↓	Adults →	US ↓	EU →
iOS bug	✗ Both	✗ Both	✗ All	✗ All
School started	✓ YES	✓ NO	✓ YES	✗ Sept starts later
Competitor	✗ Both	✗ Both	✗ Global	✗ Global

Root cause: US school year started September 1st

Kids have less free time during school
Adults unaffected (not in school)
Europe schools start later (unaffected in early Sept)

Action: Monitor through September to confirm seasonal pattern. Compare to previous years. No product changes needed - expected seasonal variation.

Example 3: Tinder Usage Drop Among Professionals

Initial statement: "Tinder usage declined"

Phase 1: Narrow the scope

People: College-educated professionals
Geography: Major metro areas (NYC, SF, Chicago)
Platform: All platforms
Time: Overnight drop

Phase 2: Hypothesis brainstorming

Intrinsic: Nothing deployed, no bugs, no pricing changes

Extrinsic:

Competitor launch targeting professionals
Negative news coverage
Regulatory change

Phase 3: Investigation

Industry research: Competing dating app (Hinge, The League) launched marketing campaign targeting professionals
Ad campaign: "Dating app for professionals, not hookups"
Positioning directly competitive to Tinder

Root cause: Competitor marketing campaign successfully attracted target demographic

Action:

Counter-positioning campaign
Feature development for relationship-seeking users
Monitor competitor strategy evolution

Example 4: Airbnb Booking Value Drop in LA Summer

Initial statement: "Average booking value down in Los Angeles"

Phase 1: Narrow the scope

Geography: Los Angeles specifically
Time: Summer months
Users: Younger travelers (18-25)
Booking type: Shorter stays (2-3 days)

Phase 2: Hypothesis brainstorming

Intrinsic:

Algorithm showing cheaper properties first (check - no change)
New low-cost inventory added (check - yes, some increase)

Extrinsic:

Budget airlines offering discount LA flights (MATCH!)
Economic downturn (dismissed - other cities unaffected)
Competitor taking high-value bookings (dismissed - market share stable)

Root cause: Multiple discount airlines launched summer routes to LA, attracting younger budget travelers who book cheaper accommodations

Action: None needed - increased volume of lower-value bookings is net positive for revenue. Adjust forecasting models to account for seasonal booking value variation.

Related Skills

north-star-alignment: Assesses whether metric change impacts top-line goals (Step 4 of diagnosis)
funnel-metric-mapping: Uses funnel location to narrow problem scope
tradeoff-evaluation: Evaluates whether to rollback, keep, or mitigate after diagnosis
metric-diagnosis (workflow): Orchestrates root cause investigation systematically
meeting-synthesis: Gathers customer evidence during investigation

Integration Points

Called by workflows:

metric-diagnosis - Steps 1-3: Complete investigation process
tradeoff-decision - May use to understand why metrics changed

May call:

meeting-synthesis to find customer evidence of issues
north-star-alignment to assess strategic impact
funnel-metric-mapping to identify which stage is affected

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/root-cause-diagnosis
License: MIT License

Featured Tools

Join Our Newsletter

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Root Cause Diagnosis

Purpose

When to Use This Skill

The Iron Law

The 4 Critical Dimensions

Dimension 1: People (User Segments)

Dimension 2: Geography

Dimension 3: Technology (Platform)

Dimension 4: Time

Intrinsic vs. Extrinsic Factors

Intrinsic Factors (Internal)

Extrinsic Factors (External)

The Hypothesis Table Method

Step 1: Create Table Structure

Step 2: Fill in Predictions

Step 3: Match Patterns

Step 4: Request Differentiating Data

Example Table: Snapchat Usage Down

Systematic Investigation Sequence

Phase 1: Data Quality Verification (5 minutes)

Phase 2: Narrow the Scope (15 minutes)

Phase 3: Internal Investigation (30 minutes)

Phase 4: External Investigation (30 minutes)

Phase 5: Hypothesis Testing (Ongoing)

Common Diagnostic Patterns

Pattern 1: Gradual Decline Over Weeks

Pattern 2: Sudden Overnight Drop

Pattern 3: Recurring Pattern (Weekly, Daily)

Pattern 4: Segment-Specific Issue

Stakeholder Consultation Strategy

When to Talk to Sales

When to Talk to Customer Success

When to Talk to Marketing

When to Talk to Engineering

Common Mistakes

Anti-Rationalization Blocks

Success Criteria

Real-World Examples

Example 1: Microsoft Teams Downloads Down 50%

Example 2: Snapchat Usage Drop

Example 3: Tinder Usage Drop Among Professionals

Example 4: Airbnb Booking Value Drop in LA Summer

Related Skills

Integration Points

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state