Agent skill
troubleshooting-health-checks
Debugs and troubleshoots Mission Control health checks by analyzing check configurations, reviewing failure patterns, and identifying root causes. Use when users ask about failing health checks, mention specific health check names or IDs, inquire why a health check is failing or unhealthy, or need help understanding health check errors and timeouts.
Install this agent skill to your Project
npx add-skill https://github.com/flanksource/claude-code-plugin/tree/main/skills/troubleshooting-health-checks
SKILL.md
Health Check Troubleshooting Skill
Core Purpose
This skill enables Claude to troubleshoot Mission Control health checks by analyzing check configurations, diagnosing failure patterns, identifying timeout and error root causes, and recommending configuration adjustments to improve reliability.
Note: Read @skills/troubleshooting-health-checks/references/query-syntax.md for query syntax
Health check troubleshooting workflow
Copy this checklist and track your progress:
Troubleshooting Progress:
- [ ] Step 1: Gather health check information
- [ ] Step 2: Analyze failure patterns
- [ ] Step 3: Cross-reference configuration issues
- [ ] Step 4: Create diagnostic summary
- [ ] Step 5: Verify remediation steps
Gather health check information
To begin with, get the id of the check in question.
Use search_health_checks with query syntax to find checks. Read @skills/troubleshooting-health-checks/references/query-syntax.md for query syntax
Else, if you could not get the health check Id from the user provided name, use list_all_checks to get complete metadata for all health checks .
Then, follow this procedure:
- Historical Context: Use
get_check_statusto retrieve execution history - Investigate the check specification: Understand the intention of the check.
- Investigate the changes to the canary: Use
search_catalog_changes(<canary_uuid>)to get the changes on the canary. Look for the change details to see any new changes on the specification.
Analyze failure patterns
Examine the historical data to identify patterns. Look for:
- Intermittent failures: Passes sometimes, fails others
- Suggests: Network instability, load-related issues, race conditions
- Consistent failures: Always failing
- Suggests: Configuration error, endpoint down, authentication issue
- Recent pattern changes: Was passing, now failing
- Suggests: Recent deployment, config change, infrastructure change
- Timeout patterns: Fails with timeout errors
- Suggests: Performance degradation, insufficient timeout value
- Time-based patterns: Fails at specific times
- Suggests: Scheduled jobs, traffic patterns, resource contention
Duration analysis:
- Increasing duration → Performance degrading (may lead to timeouts)
- Spiky duration → Intermittent load or resource contention
- Consistent slow duration → Timeout threshold too aggressive
Create diagnostic summary
Organize findings systematically. Include:
-
Primary diagnosis
- Root cause identification with supporting evidence
- Quote specific error messages from last_result
- Reference historical pattern statistics
- Cite configuration values that contribute to the issue
-
Contributing factors
- Secondary issues that may worsen the problem
- Environmental factors (network, infrastructure)
- Configuration mismatches
-
Impact assessment
- How long has the issue persisted
- Frequency and severity of failures
- Potential downstream effects
Example diagnostic format:
The health check "api-status" (ID: check-123) is failing based on
get_check_statushistory showing error "timeout exceeded" in recent executions. Historical data shows duration increasing from 3s to 5s over 6 hours. This indicates backend performance degradation requiring investigation and potential timeout adjustment.
Verify remediation steps
Provide and validate specific fixes. For each recommendation:
- Use
run_health_checkto test fixes immediately - Verify check passes after configuration changes
- Monitor execution duration and response
Success criteria checklist
Before completing troubleshooting:
- Health check configuration fully analyzed
- Failure pattern clearly identified with evidence
- Root cause diagnosed with supporting data
- Specific remediation steps provided
- Configuration adjustments justified
- Validation approach included
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
write-canary-transformations
Write correct transform blocks for Mission Control canary checks including fan-out, inline, and generated canary patterns. Use when adding transformations to canary checks, splitting a single check into multiple results, modifying check output, or generating child canaries from discovered resources.
write-canary-tests
Write correct test blocks and assertions for Mission Control canary health checks. Use when creating canaries that need pass/fail conditions, adding test expressions, or writing assertions based on HTTP status, JSON response, exec output, or Kubernetes health.
troubleshooting-notifications
Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.
troubleshooting-config-item
Troubleshoots infrastructure and application configuration items in Mission Control by diagnosing health issues, analyzing recent changes, and investigating resource relationships. Use when users ask about unhealthy or failing resources, mention specific config items by name or ID, inquire about Kubernetes pods/deployments/services, AWS EC2 instances/volumes, Azure VMs, or other infrastructure components. Also use when investigating why a resource is down, stopped, degraded, or showing errors, or when analyzing what changed that caused an issue.
promotion-eval-create
Create a promotion evaluation template for any system by gathering requirements through structured questions and generating a reusable evaluation skill. Use when users ask to create a promotion check, release readiness evaluation, environment health template, or want to build a custom evaluation workflow for systems beyond Mission Control.
promotion-eval-mission-control
Evaluates a Mission Control environment's platform health for release or promotion readiness. Checks health check pipelines, config scrapers, background jobs, notifications, event queues, and MC infrastructure. Use for pre-release checks, environment promotion, or environment status. Triggers: "check environment health", "is it ready for release", "pre-release health check", "evaluate environment", "promotion readiness", "environment status"
Didn't find tool you were looking for?