Agent skills
prd-v08-monitoring-setup

Agent skill

prd-v08-monitoring-setup

Define monitoring strategy, metrics collection, and alerting thresholds during PRD v0.8 Deployment & Ops. Triggers on requests to set up monitoring, define alerts, or when user asks "what should we monitor?", "alerting strategy", "observability", "metrics", "SLOs", "dashboards", "monitoring setup". Outputs MON- entries with monitoring rules and alert configurations.

View SKILL.md on GitHub Repository

Stars 26

Forks 6

Install this agent skill to your Project

npx add-skill https://github.com/mattgierhart/PRD-driven-context-engineering/tree/main/.claude/skills/prd-v08-monitoring-setup

SKILL.md

Monitoring Setup

Position in workflow: v0.8 Runbook Creation → v0.8 Monitoring Setup → v0.9 GTM Strategy

Consumes

This skill requires prior work from v0.8 Runbook Creation and earlier stages:

RUN-* runbook entries (from v0.8 Runbook Creation) — Incident response runbooks define alerting scenarios; critical alerts must link to RUN- procedures
DEP-* deployment entries (from v0.8 Release Planning) — DEP- rollback thresholds and post-deploy validation steps inform MON- alert conditions and SLO targets
API-* endpoint contracts (from v0.6 Technical Specification) — Define baseline latency, throughput, and error rates for application-layer metrics
KPI-* metrics (from v0.3 Outcome Definition and v0.9 Launch Metrics) — Business metrics (signups, conversions, retention) inform dashboard design and business layer monitoring
ARC-* architecture decisions (from v0.6 Architecture Design) — System structure determines which components to monitor (monolith has different metrics than distributed services)
TECH-* technology stack (from v0.5 Technical Stack Selection) — Technology choices (database, cloud provider, APM tools) determine available metrics and monitoring tools

This skill assumes DEP- and RUN- entries are complete with thresholds, rollback conditions, and incident procedures defined.

Produces

This skill creates/updates:

MON-* entries (monitoring specifications, metric/alert/dashboard/SLO types) — Concrete monitoring rules with thresholds, alert conditions, dashboards, SLO definitions, linked to RUN- procedures
Alert routing configuration — Mapping of MON- alerts to notification channels and teams; links alerts to RUN- incident procedures
Observability baseline — Metrics gathered from staging/production, establishing normal operating ranges for alert thresholds

All MON- entries are operational monitoring specifications, not confidence-based. They are:

Measurable (every metric has a source, unit, and aggregation method)
Actionable (every alert has a RUN- procedure; no orphaned alerts)
Thresholded (critical/warning severity with specific numeric conditions)
Dashboarded (MON- dashboard entries provide visibility to operators and stakeholders)
SLO-backed (SLO entries tie monitoring to product commitments)

Example MON- entries:

markdown

MON-001: API Request Latency (p95)
Type: Metric
Layer: Application
Owner: Backend Team

Name: api.request.latency.p95
Description: 95th percentile response time for all API endpoints (from API-001–020)
Unit: ms
Source: Application APM (Datadog custom instrumentation)
Aggregation: p95 over 5-minute window
Retention: 90 days

Linked IDs: API-001 to API-020, DEP-004 (baseline from staging)

---

MON-002: High Latency Alert (Warning)
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: >500ms (from DEP-002 baseline)
Window: 5 minutes
Severity: Warning
Runbook: RUN-001 (Performance Degradation Investigation)

Notification:
  - Channel: Slack #backend-alerts
  - Recipients: Backend on-call, team notified during business hours

Silencing: During scheduled maintenance windows (DEP-004 notifications)

Linked IDs: MON-001, RUN-001, DEP-002

---

MON-003: Critical Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: >2000ms (SLA breach, from KPI-001 target)
Window: 2 minutes
Severity: Critical
Runbook: RUN-001 (Performance Degradation Investigation)

Notification:
  - Channel: PagerDuty (wake on-call)
  - Recipients: Backend on-call, Tech Lead, escalate if not acknowledged in 5 min

Silencing: None (critical alerts never silenced)

Linked IDs: MON-001, RUN-001, KPI-001

---

MON-004: API Availability SLO
Type: SLO
Layer: Application
Owner: Platform Team

Objective: API endpoints return non-5xx response
Target: 99.9% uptime (from DEP-002 / KPI-001)
Window: Rolling 30 days
Error Budget: 43.2 minutes/month

Alerting:
  - 50% error budget consumed → Warning to engineering (slow-burn alert)
  - 75% error budget consumed → Critical, freeze non-essential deploys
  - 100% error budget consumed → Post-incident review required (RUN-008 procedure)

Linked IDs: API-001–020, DEP-003 (rollback triggers), RUN-008 (incident review)

---

MON-005: System Health Dashboard
Type: Dashboard
Layer: Infrastructure + Application
Owner: Platform Team

Purpose: Quick health check for on-call engineers (run from RUN-002, RUN-001)
Audience: On-call engineers, engineering leadership, ops team
Panels:
  - API Request Rate (last 1h): Should be steady or increasing
  - API Latency (p50, p95, p99): Watch for p95/p99 creeping up
  - Error Rate by Endpoint: Any 5xx > 0 is concerning
  - Active Critical Alerts: Should be none
  - Database Connection Pool (from MON-006): Trending toward threshold
  - CPU/Memory by Service: Identify resource exhaustion
  - Deployment Status: Current version, time of last deploy
Refresh: 30 seconds

Linked IDs: MON-001, MON-002, MON-003, MON-006, DEP-001, RUN-001/002

---

MON-006: Database Connection Pool Utilization
Type: Metric
Layer: Infrastructure
Owner: Database Team

Name: db.connection_pool.utilized_percent
Description: Percentage of available connections in use (from DEP-001 pool size)
Unit: percentage
Source: Database monitoring (RDS Enhanced Monitoring or custom query)
Aggregation: avg over 1-minute window
Retention: 30 days

Linked IDs: DEP-001 (pool config), RUN-001 (incident when >90%)

Core Concept: Monitoring as Early Warning

Monitoring is not about collecting data—it is about detecting problems before users do. Every metric should answer: "Is this working? If not, what's broken?"

Monitoring Layers

Layer	What to Measure	Why It Matters
Infrastructure	CPU, memory, disk, network	System health foundation
Application	Latency, errors, throughput	User-facing performance
Business	Signups, conversions, revenue	Product health
User Experience	Page load, interaction time	Real user impact

Execution

Define SLOs (Service Level Objectives)
- What uptime do we promise?
- What latency is acceptable?
- What error rate is tolerable?
Identify key metrics per layer
- Infrastructure: Resource utilization
- Application: RED metrics (Rate, Errors, Duration)
- Business: KPI- from v0.3 and v0.9
- User: Core Web Vitals, journey completion
Set alert thresholds
- Warning: Investigate soon
- Critical: Act immediately
- Base on SLOs and historical data
Map alerts to runbooks
- Every critical alert → RUN- procedure
- No alert without action path
Design dashboards
- Overview: System health at a glance
- Deep-dive: Per-service details
- Business: KPI tracking
Create MON- entries with full traceability

MON- Output Template

MON-XXX: [Monitoring Rule Title]
Type: [Metric | Alert | Dashboard | SLO]
Layer: [Infrastructure | Application | Business | User Experience]
Owner: [Team responsible for this metric/alert]

For Metric Type:
  Name: [metric.name.format]
  Description: [What this measures]
  Unit: [count | ms | percentage | bytes]
  Source: [Where this comes from]
  Aggregation: [avg | sum | p50 | p95 | p99]
  Retention: [How long to keep data]

For Alert Type:
  Metric: [MON-YYY or metric name]
  Condition: [Threshold expression]
  Window: [Time window for evaluation]
  Severity: [Critical | Warning | Info]
  Runbook: [RUN-XXX to follow when fired]
  Notification:
    - Channel: [Slack, PagerDuty, Email]
    - Recipients: [Team or individuals]
  Silencing: [When to suppress, e.g., maintenance windows]

For Dashboard Type:
  Purpose: [What questions this answers]
  Audience: [Who uses this dashboard]
  Panels: [List of visualizations]
  Refresh: [How often to update]

For SLO Type:
  Objective: [What we promise]
  Target: [Percentage, e.g., 99.9%]
  Window: [Rolling 30 days]
  Error Budget: [How much downtime allowed]
  Alerting: [When error budget is at risk]

Linked IDs: [API-XXX, UJ-XXX, KPI-XXX, RUN-XXX related]

Example MON- entries:

MON-001: API Request Latency (p95)
Type: Metric
Layer: Application
Owner: Backend Team

Name: api.request.latency.p95
Description: 95th percentile response time for all API endpoints
Unit: ms
Source: Application APM (Datadog/New Relic)
Aggregation: p95
Retention: 90 days

Linked IDs: API-001 to API-020

MON-002: High Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: > 500ms
Window: 5 minutes
Severity: Warning
Runbook: RUN-006 (Performance Degradation Investigation)

Notification:
  - Channel: Slack #backend-alerts
  - Recipients: Backend on-call

Silencing: During scheduled deployments (DEP-002 windows)

Linked IDs: MON-001, RUN-006, DEP-002

MON-003: Critical Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: > 2000ms
Window: 2 minutes
Severity: Critical
Runbook: RUN-006 (Performance Degradation Investigation)

Notification:
  - Channel: PagerDuty
  - Recipients: Backend on-call, Tech Lead

Silencing: None (always alert on critical)

Linked IDs: MON-001, RUN-006

MON-004: API Availability SLO
Type: SLO
Layer: Application
Owner: Platform Team

Objective: API endpoints return non-5xx response
Target: 99.9%
Window: Rolling 30 days
Error Budget: 43.2 minutes/month

Alerting:
  - 50% budget consumed → Warning to engineering
  - 75% budget consumed → Critical, freeze non-essential deploys
  - 100% budget consumed → Incident review required

Linked IDs: API-001 to API-020, DEP-003

MON-005: System Health Dashboard
Type: Dashboard
Layer: Infrastructure + Application
Owner: Platform Team

Purpose: Quick health check for on-call engineers
Audience: On-call, engineering leadership
Panels:
  - API Request Rate (last 1h)
  - API Latency (p50, p95, p99)
  - Error Rate by Endpoint
  - Active Alerts
  - Database Connection Pool
  - CPU/Memory by Service
Refresh: 30 seconds

Linked IDs: MON-001, MON-002, MON-003

The RED Method (Application Monitoring)

For each service, measure:

Metric	What It Measures	Alert Threshold
Rate	Requests per second	Anomaly detection
Errors	Failed requests / total	>1% warning, >5% critical
Duration	Request latency (p95, p99)	>500ms warning, >2s critical

The USE Method (Infrastructure Monitoring)

For each resource (CPU, memory, disk, network):

Metric	What It Measures	Alert Threshold
Utilization	% of capacity used	>80% warning, >95% critical
Saturation	Queue depth, waiting	>0 for critical resources
Errors	Error count/rate	Any errors = investigate

SLO Framework

Tier	Availability	Latency (p95)	Use For
Tier 1	99.99% (52 min/yr)	<100ms	Payment, auth
Tier 2	99.9% (8.7 hr/yr)	<500ms	Core features
Tier 3	99% (3.6 days/yr)	<2s	Background jobs

Alert Severity Matrix

Severity	User Impact	Response Time	Notification
Critical	Service unusable	<5 min	PagerDuty (wake up)
Warning	Degraded experience	<30 min	Slack (business hours)
Info	No immediate impact	Next day	Dashboard/log

Dashboard Design Principles

Principle	Implementation
Answer questions	Each panel answers "Is X working?"
Hierarchy	Overview → Service → Component
Context	Show thresholds, comparisons
Actionable	Link to runbooks from alerts
Fast	Quick load, auto-refresh

Anti-Patterns

Pattern	Signal	Fix
Alert fatigue	Too many alerts, team ignores	Tune thresholds, remove noise
No runbook link	Alert fires, no one knows what to do	Every alert → RUN-
Vanity metrics	"1 million requests!" without context	Focus on user-impacting metrics
Missing baselines	No historical comparison	Establish baselines before launch
Over-monitoring	500 metrics, can't find signal	Focus on RED/USE fundamentals
Under-monitoring	"We'll add monitoring later"	Monitoring ships with code

Quality Gates

Before proceeding to v0.9 GTM Strategy:

SLOs defined for critical services (MON- SLO type)
RED metrics configured for application layer
USE metrics configured for infrastructure layer
Critical alerts linked to RUN- procedures
Overview dashboard created for on-call
Alert notification channels configured
Baseline metrics established from staging

Downstream Connections

Consumer	What It Uses	Example
On-Call Team	MON- alerts trigger response	MON-003 → page engineer
v0.9 Launch Metrics	MON- provides baseline data	MON-001 baseline → KPI-010 target
Post-Mortems	MON- data for incident analysis	"MON-005 showed spike at 14:32"
Capacity Planning	MON- trends inform scaling	USE metrics → infrastructure planning
DEP- Rollback	MON- thresholds trigger rollback	MON-002 breach → DEP-003 rollback

Detailed References

Monitoring stack examples: See references/monitoring-stack.md
MON- entry template: See assets/mon-template.md
SLO calculation guide: See references/slo-guide.md
Dashboard best practices: See references/dashboard-guide.md

Maintainer

mattgierhart Core maintainer

Source details

Full Name: mattgierhart/PRD-driven-context-engineering
Branch: main
Path in repo: .claude/skills/prd-v08-monitoring-setup
License: MIT License
Topics: claude workflow developer-tools productivity nextjs ai-assisted product-development rapid-development supabase templates

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

mattgierhart/PRD-driven-context-engineering

prd-v05-technical-stack-selection

Make technology decisions for every product capability by discovering existing assets, evaluating vendor-aligned options, and categorizing as Reuse/Extend/Build/Buy/Integrate/Research during PRD v0.5 Red Team Review. Handles both greenfield and brownfield contexts. Triggers on "tech stack", "build or buy?", "what technologies?", "technical decisions", "what do we reuse?", "existing stack", "vendor constraint", "IBM-first", "what tools do we need?", "evaluate solutions", "select tech stack". Consumes FEA- (features), SCR- (screens), RISK- (constraints). Outputs TECH- entries with decisions, rationale, and cross-references. Feeds v0.6 Architecture Design.

26 6

Explore

mattgierhart/PRD-driven-context-engineering

ghm-harvest

Extracts durable insights from temp/ files to SoT during EPIC Phase E. Triggers at EPIC completion or explicit `/ghm-harvest` invocation. Outputs new SoT entries and archive manifest.

26 6

Explore

mattgierhart/PRD-driven-context-engineering

ghm-status-sync

Synchronizes README.md Command Center with current project state. Triggers on gate changes, EPIC status changes, or explicit `/ghm-status-sync` invocation. Outputs updated README.md dashboard with current lifecycle stage, blockers, and metrics.

26 6

Explore

mattgierhart/PRD-driven-context-engineering

prd-v02-product-type-classification

Classify product approach into one of six types (Clone, Unbundle, Undercut, Slice, Wrapper, Innovation) based on competitive landscape. Triggers on PRD v0.2 work after competitive analysis, or when user asks "what type of product should we build?", "should we clone or innovate?", "is this a fast-follow opportunity?", "how should we position against competitors?", "clone vs undercut", "unbundle vs slice", or requests help choosing product strategy. Outputs BR- entries for product type classification and inherited GTM constraints.

26 6

Explore

mattgierhart/PRD-driven-context-engineering

prd-v03-outcome-definition

Define measurable success metrics (KPIs) tied to product type during PRD v0.3 Commercial Model. Triggers on requests to define success metrics, set KPI targets, determine what to measure, establish go/no-go thresholds, or when user asks "how do we measure success?", "what metrics matter?", "what's our target?", "how do we know if this works?", "define KPIs", "success criteria". Consumes Product Type Classification (BR-) from v0.2. Outputs KPI- entries with thresholds, evidence sources, and downstream gate linkages.

26 6

Explore

mattgierhart/PRD-driven-context-engineering

prd-v05-risk-discovery-interview

Surface risks through guided questioning, helping users consider pivots, constraints, and prioritization during PRD v0.5 Red Team Review. Triggers on requests to identify risks, stress-test the idea, perform red team review, or when user asks "what could go wrong?", "identify risks", "red team", "risk assessment", "challenge assumptions", "stress test the idea". Consumes all prior IDs (CFD-, BR-, FEA-, PER-, UJ-, SCR-) as interview context. Outputs RISK- entries with owner decisions and mitigations. Feeds v0.5 Technical Stack Selection.

26 6

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Monitoring Setup

Consumes

Produces

Core Concept: Monitoring as Early Warning

Monitoring Layers

Execution

MON- Output Template

The RED Method (Application Monitoring)

The USE Method (Infrastructure Monitoring)

SLO Framework

Alert Severity Matrix

Dashboard Design Principles

Anti-Patterns

Quality Gates

Downstream Connections

Detailed References

Recommended Agent Skills

prd-v05-technical-stack-selection

ghm-harvest

ghm-status-sync

prd-v02-product-type-classification

prd-v03-outcome-definition

prd-v05-risk-discovery-interview