Agent skill

ops-monitor

Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/ops-monitor-fractary-claude-plugins-98311c9c

SKILL.md

Operations Monitoring Skill

<CRITICAL_RULES> IMPORTANT: Monitoring and health check rules

  • Always check resource registry to know what resources exist
  • Query CloudWatch for actual runtime status and metrics
  • Report both healthy and unhealthy resources
  • Provide clear status summaries (healthy/degraded/unhealthy)
  • Include actionable recommendations for issues found
  • Track metrics over time to identify trends
  • Never assume health - always verify via AWS APIs </CRITICAL_RULES>

EXECUTE STEPS:

Step 1: Load Configuration and Registry

  • Read: .fractary/plugins/faber-cloud/devops.json
  • Read: .fractary/plugins/faber-cloud/deployments/${environment}/registry.json
  • Extract: List of deployed resources to monitor
  • Output: "✓ Found ${resource_count} resources to monitor"

Step 2: Determine Operation

  • If operation == "health-check":
    • Read: workflow/health-check.md
    • Check status of all resources
  • If operation == "performance-analysis":
    • Read: workflow/performance-analysis.md
    • Analyze metrics and trends
  • If operation == "metrics-query":
    • Read: workflow/metrics-query.md
    • Query specific metrics
  • Output: "✓ Operation determined: ${operation}"

Step 3: Execute Monitoring

  • For each resource in scope:
    • Query resource status via handler
    • Query CloudWatch metrics
    • Analyze current state
    • Compare against thresholds
  • Collect results for all resources
  • Output: "✓ Monitoring completed for ${resource_count} resources"

Step 4: Analyze Results

  • Read: workflow/analyze-health.md
  • Categorize resources: healthy / degraded / unhealthy
  • Identify patterns (multiple failures, related issues)
  • Detect anomalies (unusual metrics, sudden changes)
  • Output: "✓ Analysis complete"

Step 5: Generate Report

  • Create monitoring report with:
    • Overall health status
    • Resource-by-resource status
    • Metrics summary
    • Issues found
    • Recommendations
  • Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
  • Output: "✓ Report generated: ${report_path}"

Step 6: Check Thresholds

  • Compare metrics against configured thresholds
  • Identify threshold violations
  • Prioritize by severity
  • Output: "✓ Threshold check complete"

OUTPUT COMPLETION MESSAGE:

✅ COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}

${issues_summary}

Report: ${report_path}
───────────────────────────────────────
${recommendations_summary}

IF ISSUES FOUND:

⚠️  COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}

Issues:
${issue_list}

Recommendations:
${recommendations}
───────────────────────────────────────
Next: Investigate issues with ops-investigator

IF FAILURE:

❌ FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
───────────────────────────────────────
Resolution: ${resolution_steps}

<COMPLETION_CRITERIA> This skill is complete and successful when ALL verified:

1. Resources Identified

  • Resource registry loaded
  • All resources in scope identified
  • Resource types determined

2. Status Checked

  • Resource status queried from AWS
  • CloudWatch metrics collected
  • Current state determined

3. Health Analyzed

  • Resources categorized by health
  • Issues identified and prioritized
  • Patterns and anomalies detected

4. Report Generated

  • Monitoring report created
  • All findings documented
  • Recommendations provided

5. Thresholds Evaluated

  • Metrics compared to thresholds
  • Violations identified
  • Severity assessed

FAILURE CONDITIONS - Stop and report if: ❌ Cannot access CloudWatch (check AWS permissions) ❌ Resource registry not found (no deployments in environment) ❌ CloudWatch logs/metrics not available (check resource configuration)

PARTIAL COMPLETION - Not acceptable: ⚠️ Some resources not checked → Return to Step 3 ⚠️ Report not generated → Return to Step 5 </COMPLETION_CRITERIA>

  1. Monitoring Report

    • Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
    • Format: JSON with detailed findings
    • Contains: Health status, metrics, issues, recommendations
  2. Health Summary

    • Overall status: HEALTHY / DEGRADED / UNHEALTHY
    • Resource counts by status
    • Critical issues list
    • Priority recommendations

Return to agent:

json
{
  "overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
  "environment": "${environment}",
  "timestamp": "2025-10-28T...",

  "resources": {
    "total": 10,
    "healthy": 8,
    "degraded": 1,
    "unhealthy": 1
  },

  "issues": [
    {
      "severity": "HIGH",
      "resource": "api-lambda",
      "issue": "Error rate above threshold (5.2% > 1%)",
      "metric": "Errors",
      "current_value": "5.2%",
      "threshold": "1%"
    }
  ],

  "metrics_summary": {
    "api-lambda": {
      "invocations": 1250,
      "errors": 65,
      "error_rate": "5.2%",
      "duration_avg": "245ms",
      "throttles": 0
    }
  },

  "recommendations": [
    "Investigate api-lambda errors (5.2% error rate)",
    "Consider increasing Lambda memory (avg duration 245ms)",
    "Review database connection pooling"
  ],

  "report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}
**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}

Reports are stored in:

  • .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
  • Historical trends in monitoring-history.json </DOCUMENTATION>

<ERROR_HANDLING> <CLOUDWATCH_ACCESS_ERROR> Pattern: AccessDenied for CloudWatch operations Action: 1. Check if CloudWatch permissions granted 2. Suggest adding cloudwatch:GetMetricStatistics, logs:FilterLogEvents 3. Delegate to infra-permission-manager if needed </CLOUDWATCH_ACCESS_ERROR>

<RESOURCE_NOT_FOUND> Pattern: Resource doesn't exist in AWS Action: 1. Check if resource listed in registry but deleted 2. Warn about registry drift 3. Suggest verifying deployment </RESOURCE_NOT_FOUND>

<METRICS_NOT_AVAILABLE> Pattern: No metrics data for resource Action: 1. Check if resource recently created (metrics may lag) 2. Verify CloudWatch logging enabled 3. Report as "status unknown" rather than failing </METRICS_NOT_AVAILABLE> </ERROR_HANDLING>

<HEALTH_STATUS_CRITERIA> Resources are classified as:

HEALTHY:

  • Resource exists and is running
  • All metrics within thresholds
  • No errors or minimal error rate (<0.1%)
  • Performance acceptable

DEGRADED:

  • Resource exists and is running
  • Some metrics approaching thresholds (>80%)
  • Elevated error rate (0.1% - 1%)
  • Performance slightly degraded

UNHEALTHY:

  • Resource doesn't exist or is stopped
  • Metrics exceed thresholds
  • High error rate (>1%)
  • Performance severely degraded
  • Resource in failed state

UNKNOWN:

  • Cannot determine status
  • Metrics not available
  • CloudWatch access issues </HEALTH_STATUS_CRITERIA>

<METRICS_BY_RESOURCE_TYPE>

Lambda:

  • Invocations (count)
  • Errors (count)
  • Duration (ms)
  • Throttles (count)
  • ConcurrentExecutions (count)
  • Error rate = Errors / Invocations * 100

S3:

  • BucketSizeBytes (bytes)
  • NumberOfObjects (count)
  • 4xxErrors (count)
  • 5xxErrors (count)

RDS:

  • CPUUtilization (percent)
  • DatabaseConnections (count)
  • FreeableMemory (bytes)
  • ReadLatency (seconds)
  • WriteLatency (seconds)

ECS:

  • CPUUtilization (percent)
  • MemoryUtilization (percent)
  • RunningTaskCount (count)
  • DesiredTaskCount (count)

API Gateway:

  • Count (requests)
  • 4XXError (count)
  • 5XXError (count)
  • Latency (ms)
  • IntegrationLatency (ms) </METRICS_BY_RESOURCE_TYPE>

Didn't find tool you were looking for?

Be as detailed as possible for better results