Operations Monitoring Skill

<CRITICAL_RULES> IMPORTANT: Monitoring and health check rules

Always check resource registry to know what resources exist
Query CloudWatch for actual runtime status and metrics
Report both healthy and unhealthy resources
Provide clear status summaries (healthy/degraded/unhealthy)
Include actionable recommendations for issues found
Track metrics over time to identify trends
Never assume health - always verify via AWS APIs </CRITICAL_RULES>

EXECUTE STEPS:

Step 1: Load Configuration and Registry

Read: .fractary/plugins/faber-cloud/devops.json
Read: .fractary/plugins/faber-cloud/deployments/${environment}/registry.json
Extract: List of deployed resources to monitor
Output: "✓ Found ${resource_count} resources to monitor"

Step 2: Determine Operation

If operation == "health-check":
- Read: workflow/health-check.md
- Check status of all resources
If operation == "performance-analysis":
- Read: workflow/performance-analysis.md
- Analyze metrics and trends
If operation == "metrics-query":
- Read: workflow/metrics-query.md
- Query specific metrics
Output: "✓ Operation determined: ${operation}"

Step 3: Execute Monitoring

For each resource in scope:
- Query resource status via handler
- Query CloudWatch metrics
- Analyze current state
- Compare against thresholds
Collect results for all resources
Output: "✓ Monitoring completed for ${resource_count} resources"

Step 4: Analyze Results

Read: workflow/analyze-health.md
Categorize resources: healthy / degraded / unhealthy
Identify patterns (multiple failures, related issues)
Detect anomalies (unusual metrics, sudden changes)
Output: "✓ Analysis complete"

Step 5: Generate Report

Create monitoring report with:
- Overall health status
- Resource-by-resource status
- Metrics summary
- Issues found
- Recommendations
Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
Output: "✓ Report generated: ${report_path}"

Step 6: Check Thresholds

Compare metrics against configured thresholds
Identify threshold violations
Prioritize by severity
Output: "✓ Threshold check complete"

OUTPUT COMPLETION MESSAGE:

✅ COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}

${issues_summary}

Report: ${report_path}
───────────────────────────────────────
${recommendations_summary}

IF ISSUES FOUND:

⚠️  COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}

Issues:
${issue_list}

Recommendations:
${recommendations}
───────────────────────────────────────
Next: Investigate issues with ops-investigator

IF FAILURE:

❌ FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
───────────────────────────────────────
Resolution: ${resolution_steps}

<COMPLETION_CRITERIA> This skill is complete and successful when ALL verified:

✅ 1. Resources Identified

Resource registry loaded
All resources in scope identified
Resource types determined

✅ 2. Status Checked

Resource status queried from AWS
CloudWatch metrics collected
Current state determined

✅ 3. Health Analyzed

Resources categorized by health
Issues identified and prioritized
Patterns and anomalies detected

✅ 4. Report Generated

Monitoring report created
All findings documented
Recommendations provided

✅ 5. Thresholds Evaluated

Metrics compared to thresholds
Violations identified
Severity assessed

FAILURE CONDITIONS - Stop and report if: ❌ Cannot access CloudWatch (check AWS permissions) ❌ Resource registry not found (no deployments in environment) ❌ CloudWatch logs/metrics not available (check resource configuration)

PARTIAL COMPLETION - Not acceptable: ⚠️ Some resources not checked → Return to Step 3 ⚠️ Report not generated → Return to Step 5 </COMPLETION_CRITERIA>

Monitoring Report
- Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Format: JSON with detailed findings
- Contains: Health status, metrics, issues, recommendations
Health Summary
- Overall status: HEALTHY / DEGRADED / UNHEALTHY
- Resource counts by status
- Critical issues list
- Priority recommendations

Return to agent:

json

{
  "overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
  "environment": "${environment}",
  "timestamp": "2025-10-28T...",

  "resources": {
    "total": 10,
    "healthy": 8,
    "degraded": 1,
    "unhealthy": 1
  },

  "issues": [
    {
      "severity": "HIGH",
      "resource": "api-lambda",
      "issue": "Error rate above threshold (5.2% > 1%)",
      "metric": "Errors",
      "current_value": "5.2%",
      "threshold": "1%"
    }
  ],

  "metrics_summary": {
    "api-lambda": {
      "invocations": 1250,
      "errors": 65,
      "error_rate": "5.2%",
      "duration_avg": "245ms",
      "throttles": 0
    }
  },

  "recommendations": [
    "Investigate api-lambda errors (5.2% error rate)",
    "Consider increasing Lambda memory (avg duration 245ms)",
    "Review database connection pooling"
  ],

  "report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}

**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}

Reports are stored in:

.fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
Historical trends in monitoring-history.json </DOCUMENTATION>

<ERROR_HANDLING> <CLOUDWATCH_ACCESS_ERROR> Pattern: AccessDenied for CloudWatch operations Action: 1. Check if CloudWatch permissions granted 2. Suggest adding cloudwatch:GetMetricStatistics, logs:FilterLogEvents 3. Delegate to infra-permission-manager if needed </CLOUDWATCH_ACCESS_ERROR>

<RESOURCE_NOT_FOUND> Pattern: Resource doesn't exist in AWS Action: 1. Check if resource listed in registry but deleted 2. Warn about registry drift 3. Suggest verifying deployment </RESOURCE_NOT_FOUND>

<METRICS_NOT_AVAILABLE> Pattern: No metrics data for resource Action: 1. Check if resource recently created (metrics may lag) 2. Verify CloudWatch logging enabled 3. Report as "status unknown" rather than failing </METRICS_NOT_AVAILABLE> </ERROR_HANDLING>

<HEALTH_STATUS_CRITERIA> Resources are classified as:

HEALTHY:

Resource exists and is running
All metrics within thresholds
No errors or minimal error rate (<0.1%)
Performance acceptable

DEGRADED:

Resource exists and is running
Some metrics approaching thresholds (>80%)
Elevated error rate (0.1% - 1%)
Performance slightly degraded

UNHEALTHY:

Resource doesn't exist or is stopped
Metrics exceed thresholds
High error rate (>1%)
Performance severely degraded
Resource in failed state

UNKNOWN:

Cannot determine status
Metrics not available
CloudWatch access issues </HEALTH_STATUS_CRITERIA>

<METRICS_BY_RESOURCE_TYPE>

Lambda:

Invocations (count)
Errors (count)
Duration (ms)
Throttles (count)
ConcurrentExecutions (count)
Error rate = Errors / Invocations * 100

S3:

BucketSizeBytes (bytes)
NumberOfObjects (count)
4xxErrors (count)
5xxErrors (count)

RDS:

CPUUtilization (percent)
DatabaseConnections (count)
FreeableMemory (bytes)
ReadLatency (seconds)
WriteLatency (seconds)

ECS:

CPUUtilization (percent)
MemoryUtilization (percent)
RunningTaskCount (count)
DesiredTaskCount (count)

API Gateway:

Count (requests)
4XXError (count)
5XXError (count)
Latency (ms)
IntegrationLatency (ms) </METRICS_BY_RESOURCE_TYPE>

Search AI Tools

ops-monitor

Install this agent skill to your Project

SKILL.md

Operations Monitoring Skill