Agent skill
infrastructure-monitoring
Set up comprehensive infrastructure monitoring with Prometheus, Grafana, and alerting systems for metrics, health checks, and performance tracking.
Install this agent skill to your Project
npx add-skill https://github.com/aj-geddes/useful-ai-prompts/tree/main/skills/infrastructure-monitoring
SKILL.md
Infrastructure Monitoring
Table of Contents
- Overview
- When to Use
- Quick Start
- Reference Guides
- Best Practices
Overview
Implement comprehensive infrastructure monitoring to track system health, performance metrics, and resource utilization with alerting and visualization across your entire stack.
When to Use
- Real-time performance monitoring
- Capacity planning and trends
- Incident detection and alerting
- Service health tracking
- Resource utilization analysis
- Performance troubleshooting
- Compliance and audit trails
- Historical data analysis
Quick Start
Minimal working example:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: "infrastructure-monitor"
environment: "production"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Rule files
rule_files:
- "alerts.yml"
- "rules.yml"
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
// ... (see reference guides for full implementation)
Reference Guides
Detailed implementations in the references/ directory:
| Guide | Contents |
|---|---|
| Prometheus Configuration | Prometheus Configuration |
| Alert Rules | Alert Rules |
| Alertmanager Configuration | Alertmanager Configuration |
| Grafana Dashboard | Grafana Dashboard |
| Monitoring Deployment | Monitoring Deployment |
Best Practices
✅ DO
- Follow established patterns and conventions
- Write clean, maintainable code
- Add appropriate documentation
- Test thoroughly before deploying
❌ DON'T
- Skip testing or validation
- Ignore error handling
- Hard-code configuration values
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
websocket-implementation
Implement real-time bidirectional communication with WebSockets including connection management, message routing, and scaling. Use when building real-time features, chat systems, live notifications, or collaborative applications.
refactor-legacy-code
Modernize and improve legacy codebases while maintaining functionality. Use when you need to refactor old code, reduce technical debt, modernize deprecated patterns, or improve code maintainability without breaking existing behavior.
Sentiment Analysis
Classify text sentiment using NLP techniques, lexicon-based analysis, and machine learning for opinion mining, brand monitoring, and customer feedback analysis
flask-api-development
Develop lightweight Flask APIs with routing, blueprints, database integration, authentication, and request/response handling. Use when building RESTful APIs, microservices, or lightweight web services with Flask.
ML Model Explanation
Interpret machine learning models using SHAP, LIME, feature importance, partial dependence, and attention visualization for explainability
Statistical Hypothesis Testing
Conduct statistical tests including t-tests, chi-square, ANOVA, and p-value analysis for statistical significance, hypothesis validation, and A/B testing
Didn't find tool you were looking for?