Agent skill
promotion-eval-mission-control
Evaluates a Mission Control environment's platform health for release or promotion readiness. Checks health check pipelines, config scrapers, background jobs, notifications, event queues, and MC infrastructure. Use for pre-release checks, environment promotion, or environment status. Triggers: "check environment health", "is it ready for release", "pre-release health check", "evaluate environment", "promotion readiness", "environment status"
Install this agent skill to your Project
npx add-skill https://github.com/flanksource/claude-code-plugin/tree/main/skills/promotion-eval-mission-control
SKILL.md
Mission Control Promotion Evaluation Skill
Core Purpose
Systematically evaluate the health of a Mission Control environment as a platform (not demo workloads) to support release and promotion readiness decisions. This skill queries the live environment using MCP tools and produces a structured diagnostic report.
Important Distinction
This evaluation focuses on Mission Control platform health — the components that make MC work (canary-checker, config-db, mission-control deployments, core health checks, scrapers, jobs). It does NOT evaluate demo workloads or user-created resources unless they indicate a platform problem.
Known expected-fail checks: Some health checks have the label Expected-Fail=true. These are intentional test checks and should be excluded from failure counts and findings.
Parameters
When invoked, check if the user specified:
- time_window: Lookback period (default:
24h) - target: Environment to evaluate (ask user if not specified)
Evaluation Procedure
Execute these phases sequentially. After each phase, record component status and findings.
Initialize a running JSON result conforming to @skills/promotion-eval-mission-control/schema.json with:
{
"verdict": "READY",
"evaluated_at": "<current ISO timestamp>",
"time_window": "<window>",
"target": "<target environment>",
"components": {},
"findings": [],
"recommendations": []
}
Catalog Type Reference
These are the confirmed MissionControl catalog types:
MissionControl::ScrapeConfig— config scrapersMissionControl::Playbook— playbook definitionsMissionControl::Notification— notification rulesMissionControl::Job— background jobsMissionControl::Canary— canary check definitionsMissionControl::Connection— external connectionsMissionControl::Topology— topology definitions
Phase 1: Health Check Pipelines
Goal: Determine if health checks are running and passing.
- Get failing checks directly:
view_failing-health-checks_mission-controlwithwithRows=trueandselect=["id","name","type","status","severity","last_transition_time","description"] - Get total check count:
list_all_checksfor baseline metrics - Filter out expected failures: Exclude checks with label
Expected-Fail=truefrom failure counts - Drill into real failures: For each genuinely unhealthy check (not expected-fail), call
get_check_status(id, limit=10)to retrieve recent execution history. Classify as:- Transient: Occasional failures mixed with passes
- Persistent: Consistently failing across recent executions
- Assess staleness: From the check list, identify checks where
updated_atis older than the time window
Metrics to record:
total_checks: Total number of health checkshealthy_count: Number currently healthyunhealthy_count: Number currently unhealthy (excluding expected-fail)expected_fail_count: Checks labeled Expected-Failpersistent_failures: Number failing consistentlystale_count: Number not updated within time windowhealth_rate: Percentage healthy (excluding expected-fail from denominator)
Verdict logic:
- PASS: No persistent failures, stale_count == 0, health_rate > 95%
- WARN: Some transient failures OR 1-2 stale checks OR health_rate 80-95%
- FAIL: Any persistent failures OR stale_count > 2 OR health_rate < 80%
Phase 2: Config Scrapers
Goal: Verify config scrapers are active and producing fresh data.
- Find scraper configs:
search_catalogwithtype=MissionControl::ScrapeConfigandselect=["id","name","health","status","updated_at"] - Check freshness: For each scraper, check
updated_attimestamp. Flag any not updated within expected schedule (typically 1h) - Check scraper errors:
view_mission-control-system_mission-controlwithwithPanels=true— this returns scraper error counts and a list of scrapers with errors - Review recent changes:
search_catalog_changeswithtype=MissionControl::ScrapeConfig created_at>now-{window}for config changes
Metrics to record:
total_scrapers: Number of scraper configs foundactive_count: Scrapers updated within expected windowstale_count: Scrapers not recently updatederror_count: From system view scraper errors panel
Verdict logic:
- PASS: All scrapers active, no errors
- WARN: 1-2 scrapers slightly stale OR minor errors
- FAIL: Any scraper missing updates for > 2h OR significant errors
Phase 3: Background Jobs & Playbooks
Goal: Check for failed playbook runs and job errors.
- Get failed job history:
view_jobhistory_mission-controlwithwithRows=trueandselect=["name","status","duration","error","timestamp"]limit=20 - Get failed playbook runs:
get_playbook_failed_runs(limit=10)for recent failures - Get recent playbook runs:
get_playbook_recent_runs(limit=20)to calculate success rate - Drill into failures: For any failed playbook runs, call
get_playbook_run_steps(run_id)to understand the failure cause - Check playbook catalog health:
search_catalogwithtype=MissionControl::Playbook health=unhealthyto find unhealthy playbook definitions
Metrics to record:
total_recent_runs: Total playbook runs in windowfailed_runs: Number of failed runssuccess_rate: Percentage of successful runsjob_errors: Count of job errors from job history view
Verdict logic:
- PASS: success_rate > 95%, no recurring job errors
- WARN: success_rate 80-95% OR some job errors
- FAIL: success_rate < 80% OR critical/recurring job failures
Phase 4: Notification Delivery
Goal: Verify the notification pipeline is functioning.
- Get notification send history:
view_notification-send-history_mission-controlwithwithRows=trueandselect=["id","age","resource_name","resource_current_health","title","notification"]limit=20 - Get notification stats from system view: The
view_mission-control-system_mission-controlpanel (already fetched in Phase 2) includes notification counts by status (SENT, SILENCED, REPEAT-INTERVAL, etc.) - Find notification configs:
search_catalogwithtype=MissionControl::Notificationandselect=["id","name","health","status"] - Check for error notifications: For each notification config, call
get_notifications_for_resource(resource_id, status=error, since=now-{window})
Metrics to record:
total_notification_configs: Number of notification rulessent_count: From system viewsilenced_count: From system viewerror_count: Notifications with error statusdelivery_rate: sent / (sent + error) percentage
Verdict logic:
- PASS: No delivery errors, system view shows sends happening
- WARN: Some errors but delivery_rate > 95%
- FAIL: delivery_rate < 95% or notification system appears down
Phase 5: System & Event Queue
Goal: Check overall system health indicators, database, and event queue.
- System overview:
view_mission-control-system_mission-controlwithwithPanels=true(reuse from Phase 2 if already fetched)- Check scraper errors, notification stats, agent resource counts
- Database health:
view_mission-control-database_mission-controlwithwithPanels=true- Check DB size, active users, DB connections
- Connection health:
list_connectionsto verify external integrations are configured
Metrics to record:
db_size_bytes: Database sizedb_connections: Active connectionsactive_users: User counttotal_connections: Number of configured connections
Verdict logic:
- PASS: DB healthy, connections configured, no concerning metrics
- WARN: High DB connections or large DB size growth
- FAIL: Database unreachable or critical system errors
Phase 6: MC Infrastructure Health
Goal: Verify Mission Control's own Kubernetes resources are healthy.
- Get MC pods directly:
view_mission-control-pods_mission-controlwithwithRows=trueandselect=["name","namespace","status","health","updated"]— this returns all MC-related pods - Find MC deployments:
search_catalogwithtype=Kubernetes::Deploymentand name patterns:name=mission-control*name=canary-checker*name=config-db*Useselect=["id","name","health","status","updated_at"]for each.
- Describe unhealthy resources: For any unhealthy MC deployment or pod, call
describe_catalog(id)to get full details including error messages - Check recent changes:
search_catalog_changeswithtype=Kubernetes::Deployment name=mission-control* created_at>now-{window}(and similar for canary-checker, config-db) - Check related configs: For unhealthy resources, use
get_related_configsto trace Deployment → ReplicaSet → Pod
Metrics to record:
total_mc_pods: MC pods foundhealthy_pods: Healthy MC podsunhealthy_pods: Unhealthy MC podstotal_mc_deployments: MC deployments foundhealthy_deployments: Healthy MC deploymentsrecent_changes: Changes to MC components in time window
Verdict logic:
- PASS: All MC resources healthy, no concerning recent changes
- WARN: Recent deployments/changes but all healthy, OR minor pod restarts
- FAIL: Any unhealthy MC deployment or persistent pod failures
Report Generation
After all phases complete, produce the final report in two parts:
Part 1: Markdown Report
# Promotion Evaluation Report
**Target**: <target environment>
**Evaluated at**: <timestamp>
**Time window**: <window>
**Verdict**: **<READY|CAUTION|NOT_READY>**
## Summary
| Component | Status | Key Metrics |
|-----------|--------|-------------|
| Health Checks | <PASS/WARN/FAIL> | <health_rate>% healthy, <persistent_failures> persistent failures |
| Config Scrapers | <PASS/WARN/FAIL> | <active_count>/<total_scrapers> active, <error_count> errors |
| Jobs & Playbooks | <PASS/WARN/FAIL> | <success_rate>% success rate, <failed_runs> failures |
| Notifications | <PASS/WARN/FAIL/SKIP> | <sent_count> sent, <error_count> errors |
| System & DB | <PASS/WARN/FAIL> | DB <db_size>MB, <db_connections> connections |
| MC Infrastructure | <PASS/WARN/FAIL> | <healthy_pods>/<total_mc_pods> pods healthy |
## Findings
<For each finding, sorted by severity (critical first)>
### [severity] [component]: [message]
- **Resource**: [name] ([type], ID: [id])
- **Evidence**: [evidence]
## Recent Changes (Risk Factors)
<List recent changes to MC infrastructure that could affect stability>
## Recommendations
<Numbered list of actionable recommendations>
Part 2: Structured JSON
Output the completed JSON object conforming to the schema. Wrap in a code block with language json.
Overall Verdict Logic
Derive the top-level verdict from component statuses:
- READY: All components PASS (or SKIP for non-critical ones)
- CAUTION: Any component is WARN, but none are FAIL
- NOT_READY: Any component is FAIL
Components that must not FAIL for READY: health_checks, config_scrapers, mc_infrastructure
Components that may SKIP without affecting verdict: notifications
Error Handling
- If an MCP tool call fails or returns unexpected data, record the component as SKIP with a note
- Do not let one phase failure block subsequent phases — evaluate all phases independently
- Reuse data across phases when the same tool was already called (e.g., system view data)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
write-canary-transformations
Write correct transform blocks for Mission Control canary checks including fan-out, inline, and generated canary patterns. Use when adding transformations to canary checks, splitting a single check into multiple results, modifying check output, or generating child canaries from discovered resources.
troubleshooting-health-checks
Debugs and troubleshoots Mission Control health checks by analyzing check configurations, reviewing failure patterns, and identifying root causes. Use when users ask about failing health checks, mention specific health check names or IDs, inquire why a health check is failing or unhealthy, or need help understanding health check errors and timeouts.
write-canary-tests
Write correct test blocks and assertions for Mission Control canary health checks. Use when creating canaries that need pass/fail conditions, adding test expressions, or writing assertions based on HTTP status, JSON response, exec output, or Kubernetes health.
troubleshooting-notifications
Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.
troubleshooting-config-item
Troubleshoots infrastructure and application configuration items in Mission Control by diagnosing health issues, analyzing recent changes, and investigating resource relationships. Use when users ask about unhealthy or failing resources, mention specific config items by name or ID, inquire about Kubernetes pods/deployments/services, AWS EC2 instances/volumes, Azure VMs, or other infrastructure components. Also use when investigating why a resource is down, stopped, degraded, or showing errors, or when analyzing what changed that caused an issue.
promotion-eval-create
Create a promotion evaluation template for any system by gathering requirements through structured questions and generating a reusable evaluation skill. Use when users ask to create a promotion check, release readiness evaluation, environment health template, or want to build a custom evaluation workflow for systems beyond Mission Control.
Didn't find tool you were looking for?