Agent skill
research
Systematic investigation and root cause analysis. Use when debugging persistent issues, understanding complex systems, or before making architectural decisions.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/research-awannaphasch2016-jousef-landing
SKILL.md
Research Skill
Tech Stack: AWS CLI, Git, ripgrep, jq, curl, browser DevTools
Source: Extracted from CLAUDE.md investigation principles and debugging patterns.
When to Use This Skill
Use the research skill when:
- ✓ Same bug persists after 2+ fix attempts
- ✓ Need to understand unfamiliar codebase
- ✓ Investigating production incidents
- ✓ Making architectural decisions
- ✓ Debugging complex system interactions
- ✓ Learning new technology/library
DO NOT use this skill for:
- ✗ First fix attempt (try the obvious solution first)
- ✗ Well-understood problems (just fix it)
- ✗ Time-critical incidents (fix first, investigate later)
Quick Research Decision Tree
What's the problem?
├─ First time seeing this issue?
│ ├─ YES → Try obvious fix (iteration)
│ └─ NO → Same bug after 2 attempts? → RESEARCH
│
├─ Production incident?
│ ├─ Affecting users NOW? → Rollback/hotfix first, research later
│ └─ Post-incident analysis? → Deep research
│
├─ Need to understand codebase?
│ ├─ Specific function/module? → Read code + tests
│ ├─ System architecture? → Trace request flow
│ └─ Historical context? → Git blame + commit history
│
├─ Technology decision?
│ ├─ Read official docs (not blog posts)
│ ├─ Compare alternatives (trade-offs)
│ ├─ Prototype with real use case
│ └─ Document decision (ADR)
│
└─ API/Library integration?
├─ Read type requirements (don't assume)
├─ Check version compatibility
├─ Test minimal example
└─ Verify with actual data
Loop Pattern: Meta-Loop → Initial-Sensitive
Escalation Trigger:
/reflectreveals: "I've tried 3 fixes, all failed with same error"/traceoutput identical across attempts- Pattern: Stuck in retrying loop (execution changes, outcome doesn't)
- Action: Use
/hypothesisto question assumptions (switch to initial-sensitive)
Tools Used:
/observe- Notice system behavior (what's failing)/hypothesis- Generate alternative explanations (why might it fail differently than I think?)/research- Test hypotheses systematically/validate- Check if new understanding correct/reflect- Synthesize learnings after investigation
Why This Works: Research skill naturally fits initial-sensitive loop—you're questioning assumptions, not just fixing execution.
See Thinking Process Architecture - Feedback Loops for structural overview.
Core Research Principles
Principle 1: Research Before Iteration
From CLAUDE.md:
"When same bug persists after 2 fix attempts, STOP iterating and START researching. Invest 30-60 minutes understanding root cause instead of deploying more guesses."
Why This Matters:
- ❌ Iteration: Fast feedback, but wasteful after 2+ failed attempts
- ✅ Research: Upfront cost, prevents 3+ failed deployment cycles
Pattern:
Attempt 1: Hypothesis + Deploy (8 minutes)
↓ Failed
Attempt 2: Different hypothesis + Deploy (8 minutes)
↓ Failed
Attempt 3: ❌ STOP - Don't deploy another guess
SWITCH TO RESEARCH MODE:
- Read specs/docs (15 minutes)
- Inspect real data (10 minutes)
- Reproduce locally (20 minutes)
- Identify root cause (15 minutes)
↓ Total: 60 minutes research
Attempt 3 (with root cause): Deploy fix (8 minutes)
✅ Success
Total time: 76 minutes (vs 120+ minutes with blind iteration)
Principle 2: Read Primary Sources
Hierarchy of Information Quality:
| Source Type | Reliability | When to Use |
|---|---|---|
| Official Docs | Highest | First stop for API behavior, type requirements |
| Source Code | Very High | When docs are unclear or incomplete |
| GitHub Issues | High | For known bugs, edge cases, workarounds |
| Stack Overflow | Medium | For common problems, after reading docs |
| Blog Posts | Low | For general concepts, not implementation details |
| ChatGPT/AI | Lowest | Starting point only, always verify |
Example: Type System Integration
# ❌ DON'T: Ask ChatGPT "Does PyMySQL accept dicts for JSON columns?"
# ChatGPT might hallucinate or give outdated answer
# ✅ DO: Read PyMySQL documentation
open https://pymysql.readthedocs.io/en/latest/
# ✅ DO: Test with minimal example
python3 << 'EOF'
import pymysql
import json
# Test: Does PyMySQL accept dict for JSON column?
data = {'key': 'value'}
try:
cursor.execute("INSERT INTO test (json_col) VALUES (%s)", (data,))
print("✅ Dict accepted")
except TypeError as e:
print(f"❌ Dict rejected: {e}")
# Try JSON string instead
cursor.execute("INSERT INTO test (json_col) VALUES (%s)", (json.dumps(data),))
print("✅ JSON string accepted")
EOF
Principle 3: Reproduce Locally
Before deploying fixes, reproduce the issue locally.
Benefits:
- ✅ Fast iteration (seconds vs minutes)
- ✅ Can debug with breakpoints
- ✅ No cloud costs
- ✅ Can test edge cases easily
Pattern:
# Step 1: Extract minimal failing case
# Instead of testing entire Lambda function...
# ❌ BAD: Deploy to AWS, check logs, repeat
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json
# Wait 30 seconds for CloudWatch logs...
# ✅ GOOD: Reproduce locally
python3 << 'EOF'
from src.data.news_fetcher import NewsFetcher
# Exact same code as Lambda
fetcher = NewsFetcher()
result = fetcher.fetch_news('NVDA19')
print(result) # Immediate feedback
EOF
Principle 4: Inspect Real Data
Don't assume data shape—verify with actual examples.
Common Mistakes:
- Assuming API returns dict (might return list)
- Assuming field exists (might be null/missing)
- Assuming type (might be string not int)
Pattern:
# ❌ DON'T: Assume
# "The API returns a list of tickers"
# ✅ DO: Verify
curl -s https://api.example.com/tickers | jq . > sample_response.json
cat sample_response.json
# Inspect structure:
# - Is it a list or dict?
# - What fields are present?
# - What types are they?
# - Any null values?
# - Any nested structures?
Example: Aurora Schema Investigation
# Connect to Aurora via SSM tunnel
aws ssm start-session \
--target i-1234567890abcdef0 \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["aurora-endpoint"],"portNumber":["3306"],"localPortNumber":["3307"]}'
# Inspect actual schema (don't assume)
mysql -h 127.0.0.1 -P 3307 -u admin -p << 'SQL'
DESCRIBE precomputed_reports;
-- Check actual data types
SELECT
symbol,
report_json,
typeof(report_json) as json_type
FROM precomputed_reports
LIMIT 1 \G
-- Check for null values
SELECT COUNT(*) as total,
SUM(CASE WHEN report_json IS NULL THEN 1 ELSE 0 END) as null_count
FROM precomputed_reports;
SQL
Research Workflow
See WORKFLOW.md for detailed step-by-step research process.
Investigation Checklist
See INVESTIGATION-CHECKLIST.md for systematic debugging checklist.
Boundary Verification
When: Investigating distributed systems (Lambda, Aurora, S3, SQS, Step Functions)
Problem: Code looks correct but fails in production due to unverified execution boundaries
Critical questions:
- WHERE does this code run? (Lambda, EC2, local?)
- WHAT environment does it require? (env vars, network, permissions?)
- WHAT external systems does it call? (Aurora schema, S3 bucket, API format?)
- WHAT are entity properties? (Lambda timeout/memory, Aurora connection limits, intended usage)
- HOW do I verify the contract? (Terraform config, SHOW COLUMNS, test access?)
Five layers of correctness:
- Syntactic: Code compiles (Python syntax valid)
- Semantic: Code does what it claims (logic correct)
- Boundary: Code can reach what it needs (network, permissions)
- Configuration: Entity config matches code requirements (timeout, memory)
- Intentional: Usage matches designed purpose (sync Lambda not for async work)
When to apply:
- "Code looks correct but doesn't work" bugs
- Multi-service workflows (Lambda → Aurora → S3)
- After 2 failed deployment attempts (infrastructure issues)
- Before concluding "code is correct" (verify execution context)
Verification workflow:
1. Identify execution boundaries (code → runtime, code → database, service → service)
2. Identify physical entities (WHICH Lambda, WHICH Aurora, WHICH S3 bucket)
3. Verify configuration matches requirements (timeout, memory, concurrency)
4. Verify intention matches usage (async Lambda not for sync API)
5. Progress through evidence layers (code → config → runtime → ground truth)
Integration with research workflow:
- Phase 1 (Observe): Notice boundary-related failure (timeout, permission denied, schema mismatch)
- Phase 2 (Hypothesize): Identify which boundary might be violated
- Phase 3 (Research): Apply boundary verification checklist systematically
- Phase 4 (Validate): Verify contract through ground truth (test actual execution)
See: Execution Boundary Checklist for comprehensive verification workflow
Related principles:
- Principle #20 (Execution Boundary Discipline) - CLAUDE.md
- Principle #2 (Progressive Evidence Strengthening) - Verify through all layers
- Principle #15 (Infrastructure-Application Contract) - Sync code and infrastructure
Architectural Investigations
When: Choosing between technologies, patterns, or architectural approaches
Problem: "X vs Y" comparisons get vague "it depends" answers without structured analysis
Solution: Apply OWL-based relationship analysis framework for systematic comparison
Steps:
-
Define concepts being compared
- What is X? (definition, location, purpose, examples)
- What is Y? (definition, location, purpose, examples)
-
Apply 4 relationship types:
- Part-Whole: Is one part of the other, or are they peers?
- Complement: Do they handle non-overlapping concerns that work together?
- Substitution: Can one replace the other? Under what conditions?
- Composition: Can they be layered/composed into a multi-tier system?
-
Document with concrete examples
- Provide real scenarios, not abstract theory
- Include trade-off analysis (what you gain vs lose)
-
Make recommendation based on analysis
- Grounded in relationship analysis, not intuition
- Include anti-patterns to avoid
See: Relationship Analysis Guide for comprehensive methodology
Example Use Cases:
- "Should I use Redis or DynamoDB for caching?" → Apply substitution & composition analysis
- "How do CDN and application caching relate?" → Apply complement & composition analysis
- "Microservices vs Monolith?" → Apply part-whole & trade-off analysis
Common Research Scenarios
Scenario 1: Type System Mismatch
Symptom: Same error after 2 attempts, silent failure
Research Steps:
-
Read Target System Documentation
bash# PyMySQL docs open https://pymysql.readthedocs.io/en/latest/modules/cursors.html # Look for: What types does execute() accept? -
Test Minimal Example
pythonimport pymysql import json # Test hypothesis: Dict vs JSON string data = {'key': 'value'} # Test 1: Dict try: cursor.execute("INSERT INTO test (json_col) VALUES (%s)", (data,)) except TypeError as e: print(f"Dict failed: {e}") # Test 2: JSON string cursor.execute("INSERT INTO test (json_col) VALUES (%s)", (json.dumps(data),)) print("JSON string succeeded") -
Apply Fix with Confidence
python# Now we KNOW json.dumps() is required def store_report(symbol: str, report_json: dict): cursor.execute( "INSERT INTO reports (symbol, report_json) VALUES (%s, %s)", (symbol, json.dumps(report_json)) # Convert to string )
Scenario 2: Production Incident
Symptom: Users reporting errors, dashboard shows spike
Research Steps:
-
Triage (< 5 minutes)
bash# Check error count aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --start-time $(($(date +%s) - 600))000 \ --filter-pattern "ERROR" \ --query 'length(events)' # Check when it started aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda \ --metric-name Errors \ --dimensions Name=FunctionName,Value=worker \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 60 \ --statistics Sum -
Immediate Mitigation (< 10 minutes)
bash# Rollback to previous version CURRENT=$(aws lambda get-alias --function-name worker --name live \ --query 'FunctionVersion' --output text) PREVIOUS=$((CURRENT - 1)) aws lambda update-alias \ --function-name worker \ --name live \ --function-version $PREVIOUS echo "Rolled back from v$CURRENT to v$PREVIOUS" -
Post-Incident Analysis (30-60 minutes)
bash# Collect evidence mkdir incident-$(date +%Y%m%d-%H%M%S) cd incident-* # Export error logs aws logs filter-log-events \ --log-group-name /aws/lambda/worker \ --start-time $INCIDENT_START \ --end-time $INCIDENT_END \ --filter-pattern "ERROR" > errors.json # Export metrics aws cloudwatch get-metric-statistics ... > metrics.json # Get deployment diff git diff v$PREVIOUS v$CURRENT > deployment.diff # Analyze (root cause analysis) # - What changed? # - What failed? # - Why did it fail? # - How to prevent?
Scenario 3: Unfamiliar Codebase
Task: Add feature to module you've never seen
Research Steps:
-
Find Entry Point
bash# Search for main function rg "def main" --type py # Search for Lambda handler rg "def lambda_handler" --type py # Search for CLI entry point rg "click.command" --type py -
Trace Request Flow
bash# Example: How does a ticker report get generated? # Step 1: Find API endpoint rg "\/api\/reports" src/ # Found: src/telegram/api/routes/reports.py # Step 2: Read route handler cat src/telegram/api/routes/reports.py # Calls: ReportService.generate_report() # Step 3: Find service rg "class ReportService" src/ # Found: src/telegram/services/report_service.py # Step 4: Read service cat src/telegram/services/report_service.py # Calls: workflow.run() with AgentState # Step 5: Find workflow rg "def run" src/workflow/ # Found: src/workflow/graph.py -
Read Tests for Examples
bash# Tests show how to use the code rg "test.*generate_report" tests/ --type py cat tests/telegram/services/test_report_service.py # Shows: # - How to mock dependencies # - Expected input format # - Expected output structure -
Check Git History for Context
bash# Why was this code written? git log --oneline src/telegram/services/report_service.py # Read relevant commits git show abc123 # Find related PRs gh pr list --search "report service" --state closed
Anti-Patterns to Avoid
Anti-Pattern 1: Iteration Without Research
# ❌ BAD: Blind iteration
git commit -m "Try fix 1"
git push
# Wait 8 minutes...
# Failed
git commit -m "Try fix 2"
git push
# Wait 8 minutes...
# Failed
git commit -m "Try fix 3"
git push
# Wait 8 minutes...
# Failed
# Total: 24+ minutes wasted
Solution: After 2 attempts, research the root cause.
Anti-Pattern 2: Trusting Secondary Sources
# ❌ BAD: Trust blog post
# Blog: "PyMySQL accepts dicts for JSON columns"
cursor.execute("INSERT INTO tbl (json_col) VALUES (%s)", ({'key': 'val'},))
# TypeError: not all arguments converted
# ✅ GOOD: Read official docs
open https://pymysql.readthedocs.io/
# Docs: "Parameters are passed as tuples"
# Must convert dict to JSON string first
cursor.execute("INSERT INTO tbl (json_col) VALUES (%s)", (json.dumps({'key': 'val'}),))
Anti-Pattern 3: Assuming Instead of Verifying
# ❌ BAD: Assume
# "The API probably returns a list of dicts"
for ticker in response: # Crashes if response is dict, not list
process(ticker)
# ✅ GOOD: Verify
curl -s https://api.example.com/tickers | jq . > sample.json
cat sample.json
# Oh, it's actually a dict with a 'tickers' key!
for ticker in response['tickers']: # Correct
process(ticker)
Quick Reference
Research Triggers
| Situation | Action |
|---|---|
| First attempt failed | Try one more fix |
| Second attempt failed | START RESEARCH |
| Production incident | Mitigate first, research later |
| Unfamiliar codebase | Trace request flow, read tests |
| API integration | Read official docs, test minimal example |
| Type mismatch | Inspect real data, verify assumptions |
Time Investment
| Research Type | Time Budget | Expected Outcome |
|---|---|---|
| Quick lookup | 5-10 minutes | Confirm type, read docs |
| Root cause analysis | 30-60 minutes | Understand why bug persists |
| Codebase exploration | 1-2 hours | Understand module/system |
| Technology evaluation | 4-8 hours | Compare alternatives, prototype |
File Organization
.claude/skills/research/
├── SKILL.md # This file (entry point)
├── WORKFLOW.md # Step-by-step research process
└── INVESTIGATION-CHECKLIST.md # Systematic debugging checklist
Next Steps
- For research workflow: See WORKFLOW.md
- For debugging checklist: See INVESTIGATION-CHECKLIST.md
References
- How to Debug - Systematic debugging by John Regehr
- The Scientific Method of Debugging - Brendan Gregg
- Debugging: The 9 Indispensable Rules
- Systems Performance - Investigation methodologies
Didn't find tool you were looking for?