Agent skill
troubleshooting-config-item
Troubleshoots infrastructure and application configuration items in Mission Control by diagnosing health issues, analyzing recent changes, and investigating resource relationships. Use when users ask about unhealthy or failing resources, mention specific config items by name or ID, inquire about Kubernetes pods/deployments/services, AWS EC2 instances/volumes, Azure VMs, or other infrastructure components. Also use when investigating why a resource is down, stopped, degraded, or showing errors, or when analyzing what changed that caused an issue.
Install this agent skill to your Project
npx add-skill https://github.com/flanksource/claude-code-plugin/tree/main/skills/troubleshooting-config-item
SKILL.md
Config Item Troubleshooting Skill
Core Purpose
This skill enables Claude to troubleshoot infrastructure and application configuration items in Mission Control, diagnose health issues, analyze changes, and identify root causes through systematic investigation of config relationships and history.
Understanding Config Items
A ConfigItem represents a discoverable infrastructure or application configuration (Kubernetes Pods, AWS EC2 instances, Azure VMs, database instances, etc.). Each config item contains:
- health: Overall health status ("healthy", "unhealthy", "warning", "unknown")
- status: Operational state (e.g., "Running", "Stopped", "Pending")
- description: Human-readable description (often contains error messages when unhealthy)
- .config: The actual JSON specification/manifest (e.g., Kubernetes Pod spec, AWS instance details)
- type: The kind of resource (e.g., "Kubernetes::Pod", "AWS::EC2::Instance")
- tags: Metadata for filtering and organization
- parent_id/path: Hierarchical relationships to other configs
- external_id: External system identifier
Key Workflows
Initial Investigation
1. Search and Identify the Config
Use the MCP search_catalog tool to find the config item:
- Search by id, name, type, tags, or other attributes
- Narrow down to the specific config experiencing issues
2. Get Complete Config Details
Use the MCP describe_catalog tool to retrieve full config information:
- Review the health field for overall status
- Check the status field for operational state
- Read the description field carefully - this often contains error messages or status information
- Examine the .config JSON field - this contains the full specification/manifest
Change Analysis
3. Review Recent Changes
If the issue isn't immediately apparent, use the MCP search_catalog_changes tool:
- Get changes for the specific config item
- Look for recent modifications to the specification
- Check
change_type(created, updated, deleted) - Review
severity(critical, high, medium, low, info) - Examine
patchesanddifffields to see what changed - Check
sourceto understand where the change originated - Note the
created_attimestamp to correlate with when issues started
Relationship Navigation
4. Investigate Related Configs
Use the MCP get_related_configs tool to navigate the config hierarchy:
- Children: Resources created/managed by this config
- Example: A Kubernetes Deployment → ReplicaSets → Pods
- Example: An AWS Auto Scaling Group → EC2 Instances
- Parents: Resources that manage this config
- Example: A Pod → ReplicaSet → Deployment
- Dependencies: Resources this config depends on
- Example: A Pod → ConfigMaps, Secrets, PersistentVolumeClaims
Troubleshooting Pattern: When a parent resource is unhealthy, investigate its children to find the actual failing component. When a child is unhealthy, check the parent for misconfigurations.
Critical Requirements
Hierarchical Thinking:
- Kubernetes: Namespace → Deployment → ReplicaSet → Pod → Container
- AWS: VPC → Subnet → EC2 Instance → Volume
- Azure: Resource Group → VM → Disk
Change Impact Analysis:
- Compare current config with previous working state
- Identify what changed and when
- Correlate timing of changes with health degradation
Evidence-Based Diagnosis:
- Support conclusions with specific evidence from the config data
- Quote relevant error messages from description fields
- Reference specific fields in the .config JSON
- Cite change diffs and timestamps
Diagnosis Workflow
Follow this systematic approach:
- Identify - Find the config item
- Assess - Review health, status, description, and .config spec
- Analyze Changes - Check recent modifications and events
- Navigate Relationships - Investigate parent/child/dependency configs
- Review Analysis - Check automated findings
- Synthesize - Determine root cause from all evidence
- Recommend - Provide specific remediation steps
Example Troubleshooting Scenarios
Scenario 1: Unhealthy Kubernetes Deployment
- Get Deployment details → health: unhealthy
- Get related configs (children) → ReplicaSets → Pods
- Find Pod in CrashLoopBackOff
- Check Pod .config → image pull error
- Check changes → recent image tag update
- Root cause: Invalid image tag deployed
- Recommendation: Rollback to previous image or fix image tag
Scenario 2: AWS EC2 Instance Issues
- Get Instance details → status: stopped, health: unhealthy
- Check description → "InsufficientInstanceCapacity"
- Review changes → instance type changed to unavailable type
- Get related configs → Security Groups, Volumes
- Root cause: Requested instance type not available in AZ
- Recommendation: Change to available instance type or different AZ
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
write-canary-transformations
Write correct transform blocks for Mission Control canary checks including fan-out, inline, and generated canary patterns. Use when adding transformations to canary checks, splitting a single check into multiple results, modifying check output, or generating child canaries from discovered resources.
troubleshooting-health-checks
Debugs and troubleshoots Mission Control health checks by analyzing check configurations, reviewing failure patterns, and identifying root causes. Use when users ask about failing health checks, mention specific health check names or IDs, inquire why a health check is failing or unhealthy, or need help understanding health check errors and timeouts.
write-canary-tests
Write correct test blocks and assertions for Mission Control canary health checks. Use when creating canaries that need pass/fail conditions, adding test expressions, or writing assertions based on HTTP status, JSON response, exec output, or Kubernetes health.
troubleshooting-notifications
Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.
promotion-eval-create
Create a promotion evaluation template for any system by gathering requirements through structured questions and generating a reusable evaluation skill. Use when users ask to create a promotion check, release readiness evaluation, environment health template, or want to build a custom evaluation workflow for systems beyond Mission Control.
promotion-eval-mission-control
Evaluates a Mission Control environment's platform health for release or promotion readiness. Checks health check pipelines, config scrapers, background jobs, notifications, event queues, and MC infrastructure. Use for pre-release checks, environment promotion, or environment status. Triggers: "check environment health", "is it ready for release", "pre-release health check", "evaluate environment", "promotion readiness", "environment status"
Didn't find tool you were looking for?