Agent skills
promotion-eval-mission-control

Agent skill

promotion-eval-mission-control

Evaluates a Mission Control environment's platform health for release or promotion readiness. Checks health check pipelines, config scrapers, background jobs, notifications, event queues, and MC infrastructure. Use for pre-release checks, environment promotion, or environment status. Triggers: "check environment health", "is it ready for release", "pre-release health check", "evaluate environment", "promotion readiness", "environment status"

View SKILL.md on GitHub Repository

Stars 0

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/flanksource/claude-code-plugin/tree/main/skills/promotion-eval-mission-control

SKILL.md

Mission Control Promotion Evaluation Skill

Core Purpose

Systematically evaluate the health of a Mission Control environment as a platform (not demo workloads) to support release and promotion readiness decisions. This skill queries the live environment using MCP tools and produces a structured diagnostic report.

Important Distinction

This evaluation focuses on Mission Control platform health — the components that make MC work (canary-checker, config-db, mission-control deployments, core health checks, scrapers, jobs). It does NOT evaluate demo workloads or user-created resources unless they indicate a platform problem.

Known expected-fail checks: Some health checks have the label Expected-Fail=true. These are intentional test checks and should be excluded from failure counts and findings.

Parameters

When invoked, check if the user specified:

time_window: Lookback period (default: 24h)
target: Environment to evaluate (ask user if not specified)

Evaluation Procedure

Execute these phases sequentially. After each phase, record component status and findings.

Initialize a running JSON result conforming to @skills/promotion-eval-mission-control/schema.json with:

json

{
  "verdict": "READY",
  "evaluated_at": "<current ISO timestamp>",
  "time_window": "<window>",
  "target": "<target environment>",
  "components": {},
  "findings": [],
  "recommendations": []
}

Catalog Type Reference

These are the confirmed MissionControl catalog types:

MissionControl::ScrapeConfig — config scrapers
MissionControl::Playbook — playbook definitions
MissionControl::Notification — notification rules
MissionControl::Job — background jobs
MissionControl::Canary — canary check definitions
MissionControl::Connection — external connections
MissionControl::Topology — topology definitions

Phase 1: Health Check Pipelines

Goal: Determine if health checks are running and passing.

Get failing checks directly: view_failing-health-checks_mission-control with withRows=true and select=["id","name","type","status","severity","last_transition_time","description"]
Get total check count: list_all_checks for baseline metrics
Filter out expected failures: Exclude checks with label Expected-Fail=true from failure counts
Drill into real failures: For each genuinely unhealthy check (not expected-fail), call get_check_status(id, limit=10) to retrieve recent execution history. Classify as:
- Transient: Occasional failures mixed with passes
- Persistent: Consistently failing across recent executions
Assess staleness: From the check list, identify checks where updated_at is older than the time window

Metrics to record:

total_checks: Total number of health checks
healthy_count: Number currently healthy
unhealthy_count: Number currently unhealthy (excluding expected-fail)
expected_fail_count: Checks labeled Expected-Fail
persistent_failures: Number failing consistently
stale_count: Number not updated within time window
health_rate: Percentage healthy (excluding expected-fail from denominator)

Verdict logic:

PASS: No persistent failures, stale_count == 0, health_rate > 95%
WARN: Some transient failures OR 1-2 stale checks OR health_rate 80-95%
FAIL: Any persistent failures OR stale_count > 2 OR health_rate < 80%

Phase 2: Config Scrapers

Goal: Verify config scrapers are active and producing fresh data.

Find scraper configs: search_catalog with type=MissionControl::ScrapeConfig and select=["id","name","health","status","updated_at"]
Check freshness: For each scraper, check updated_at timestamp. Flag any not updated within expected schedule (typically 1h)
Check scraper errors: view_mission-control-system_mission-control with withPanels=true — this returns scraper error counts and a list of scrapers with errors
Review recent changes: search_catalog_changes with type=MissionControl::ScrapeConfig created_at>now-{window} for config changes

Metrics to record:

total_scrapers: Number of scraper configs found
active_count: Scrapers updated within expected window
stale_count: Scrapers not recently updated
error_count: From system view scraper errors panel

Verdict logic:

PASS: All scrapers active, no errors
WARN: 1-2 scrapers slightly stale OR minor errors
FAIL: Any scraper missing updates for > 2h OR significant errors

Phase 3: Background Jobs & Playbooks

Goal: Check for failed playbook runs and job errors.

Get failed job history: view_jobhistory_mission-control with withRows=true and select=["name","status","duration","error","timestamp"] limit=20
Get failed playbook runs: get_playbook_failed_runs(limit=10) for recent failures
Get recent playbook runs: get_playbook_recent_runs(limit=20) to calculate success rate
Drill into failures: For any failed playbook runs, call get_playbook_run_steps(run_id) to understand the failure cause
Check playbook catalog health: search_catalog with type=MissionControl::Playbook health=unhealthy to find unhealthy playbook definitions

Metrics to record:

total_recent_runs: Total playbook runs in window
failed_runs: Number of failed runs
success_rate: Percentage of successful runs
job_errors: Count of job errors from job history view

Verdict logic:

PASS: success_rate > 95%, no recurring job errors
WARN: success_rate 80-95% OR some job errors
FAIL: success_rate < 80% OR critical/recurring job failures

Phase 4: Notification Delivery

Goal: Verify the notification pipeline is functioning.

Get notification send history: view_notification-send-history_mission-control with withRows=true and select=["id","age","resource_name","resource_current_health","title","notification"] limit=20
Get notification stats from system view: The view_mission-control-system_mission-control panel (already fetched in Phase 2) includes notification counts by status (SENT, SILENCED, REPEAT-INTERVAL, etc.)
Find notification configs: search_catalog with type=MissionControl::Notification and select=["id","name","health","status"]
Check for error notifications: For each notification config, call get_notifications_for_resource(resource_id, status=error, since=now-{window})

Metrics to record:

total_notification_configs: Number of notification rules
sent_count: From system view
silenced_count: From system view
error_count: Notifications with error status
delivery_rate: sent / (sent + error) percentage

Verdict logic:

PASS: No delivery errors, system view shows sends happening
WARN: Some errors but delivery_rate > 95%
FAIL: delivery_rate < 95% or notification system appears down

Phase 5: System & Event Queue

Goal: Check overall system health indicators, database, and event queue.

System overview: view_mission-control-system_mission-control with withPanels=true (reuse from Phase 2 if already fetched)
- Check scraper errors, notification stats, agent resource counts
Database health: view_mission-control-database_mission-control with withPanels=true
- Check DB size, active users, DB connections
Connection health: list_connections to verify external integrations are configured

Metrics to record:

db_size_bytes: Database size
db_connections: Active connections
active_users: User count
total_connections: Number of configured connections

Verdict logic:

PASS: DB healthy, connections configured, no concerning metrics
WARN: High DB connections or large DB size growth
FAIL: Database unreachable or critical system errors

Phase 6: MC Infrastructure Health

Goal: Verify Mission Control's own Kubernetes resources are healthy.

Get MC pods directly: view_mission-control-pods_mission-control with withRows=true and select=["name","namespace","status","health","updated"] — this returns all MC-related pods
Find MC deployments: search_catalog with type=Kubernetes::Deployment and name patterns:
- name=mission-control*
- name=canary-checker*
- name=config-db* Use select=["id","name","health","status","updated_at"] for each.
Describe unhealthy resources: For any unhealthy MC deployment or pod, call describe_catalog(id) to get full details including error messages
Check recent changes: search_catalog_changes with type=Kubernetes::Deployment name=mission-control* created_at>now-{window} (and similar for canary-checker, config-db)
Check related configs: For unhealthy resources, use get_related_configs to trace Deployment → ReplicaSet → Pod

Metrics to record:

total_mc_pods: MC pods found
healthy_pods: Healthy MC pods
unhealthy_pods: Unhealthy MC pods
total_mc_deployments: MC deployments found
healthy_deployments: Healthy MC deployments
recent_changes: Changes to MC components in time window

Verdict logic:

PASS: All MC resources healthy, no concerning recent changes
WARN: Recent deployments/changes but all healthy, OR minor pod restarts
FAIL: Any unhealthy MC deployment or persistent pod failures

Report Generation

After all phases complete, produce the final report in two parts:

Part 1: Markdown Report

# Promotion Evaluation Report

**Target**: <target environment>
**Evaluated at**: <timestamp>
**Time window**: <window>
**Verdict**: **<READY|CAUTION|NOT_READY>**

## Summary

| Component | Status | Key Metrics |
|-----------|--------|-------------|
| Health Checks | <PASS/WARN/FAIL> | <health_rate>% healthy, <persistent_failures> persistent failures |
| Config Scrapers | <PASS/WARN/FAIL> | <active_count>/<total_scrapers> active, <error_count> errors |
| Jobs & Playbooks | <PASS/WARN/FAIL> | <success_rate>% success rate, <failed_runs> failures |
| Notifications | <PASS/WARN/FAIL/SKIP> | <sent_count> sent, <error_count> errors |
| System & DB | <PASS/WARN/FAIL> | DB <db_size>MB, <db_connections> connections |
| MC Infrastructure | <PASS/WARN/FAIL> | <healthy_pods>/<total_mc_pods> pods healthy |

## Findings

<For each finding, sorted by severity (critical first)>
### [severity] [component]: [message]
- **Resource**: [name] ([type], ID: [id])
- **Evidence**: [evidence]

## Recent Changes (Risk Factors)

<List recent changes to MC infrastructure that could affect stability>

## Recommendations

<Numbered list of actionable recommendations>

Part 2: Structured JSON

Output the completed JSON object conforming to the schema. Wrap in a code block with language json.

Overall Verdict Logic

Derive the top-level verdict from component statuses:

READY: All components PASS (or SKIP for non-critical ones)
CAUTION: Any component is WARN, but none are FAIL
NOT_READY: Any component is FAIL

Components that must not FAIL for READY: health_checks, config_scrapers, mc_infrastructure Components that may SKIP without affecting verdict: notifications

Error Handling

If an MCP tool call fails or returns unexpected data, record the component as SKIP with a note
Do not let one phase failure block subsequent phases — evaluate all phases independently
Reuse data across phases when the same tool was already called (e.g., system view data)

Maintainer

flanksource Core maintainer

Source details

Full Name: flanksource/claude-code-plugin
Branch: main
Path in repo: skills/promotion-eval-mission-control

Featured Tools

Join Our Newsletter

Create a promotion evaluation template for any system by gathering requirements through structured questions and generating a reusable evaluation skill. Use when users ask to create a promotion check, release readiness evaluation, environment health template, or want to build a custom evaluation workflow for systems beyond Mission Control.

0 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Mission Control Promotion Evaluation Skill

Core Purpose

Important Distinction

Parameters

Evaluation Procedure

Catalog Type Reference

Phase 1: Health Check Pipelines

Phase 2: Config Scrapers

Phase 3: Background Jobs & Playbooks

Phase 4: Notification Delivery

Phase 5: System & Event Queue

Phase 6: MC Infrastructure Health

Report Generation

Part 1: Markdown Report

Part 2: Structured JSON

Overall Verdict Logic

Error Handling

Recommended Agent Skills

write-canary-transformations

troubleshooting-health-checks

write-canary-tests

troubleshooting-notifications

troubleshooting-config-item

promotion-eval-create