Agent skills
Cost Observability and Monitor...

Agent skill

Cost Observability and Monitoring

Techniques for gaining visibility into cloud spending, attributing costs to business units, and detecting financial anomalies.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/cost-observability

SKILL.md

Cost Observability and Monitoring

Overview

Cost Observability is the practice of extending traditional system observability (logs, metrics, traces) to include Financial data. It allows engineering teams to answer not just "Is the system healthy?" but "Is the system cost-effective?".

Core Principle: "Total spend is a vanity metric; cost per unit of work is a performance metric."

1. Key Cost Metrics to Track

The goal is to move from Macro visibility (the bill) to Micro visibility (the request).

Metric	Level	Purpose
Total Monthly Spend	Executive	General budget health.
Cost per Service	Engineering	Identify inefficient microservices.
Cost per Customer (Unit Cost)	Product	Calculate per-account profitability.
Cost per Request	Engineering	Measure efficiency of application code.
COGS (Cost of Goods Sold)	Financial	The base cost to deliver the service.

2. Cost Attribution and Tagging Strategy

Attribution is impossible without consistent metadata.

The Standard Tagging Schema

Every resource should have the following "FinOps Tags":

Environment: (e.g., prod, staging, dev)
Service: (e.g., auth-api, image-processor)
Owner: (e.g., team-alpha)
Project: (e.g., project-phoenix)
TenantID: (If using siloed resources per customer)

Enforcement Policy (Terraform/OpenTofu)

hcl

# Use a variable for mandatory tags
locals {
  mandatory_tags = {
    Environment = var.environment
    Service     = "payment-gateway"
    Owner       = "finance-team"
    CostCenter  = "9921"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"
  tags          = local.mandatory_tags
}

3. Cost Anomaly Detection

A financial anomaly is a sudden deviate from historical spend patterns.

Types of Anomalies

Sudden Spikes: A developers spins up a massive GPU instance and forgets to delete it.
Gradual Drift: A memory leak causes auto-scaling to add a new server every day.
Cyclical Variation: Spend increases during weekends when it should be lower.

Anomaly Alert Example (Slack/PagerDuty)

Alert: "AWS Spend Spike Detected"
Metric: S3 Egress
Deviation: +450% over the last 24 hours.
Likely Cause: Possible data exfiltration or misconfigured backup script.

4. Application-Level Cost Tracking

Sometimes cloud tags aren't granular enough (e.g., when multiple customers share one database).

OpenTelemetry for Cost

You can inject "cost attributes" into your traces to calculate the price of a specific API endpoint.

typescript

// Example: Tracking LLM cost in a trace
import { trace } from '@opentelemetry/api';

const span = trace.getTracer('llm-tracer').startSpan('generate_text');
// ... perform LLM call
const cost = (inputTokens * 0.00001) + (outputTokens * 0.00003);

span.setAttribute('app.cost.usd', cost);
span.setAttribute('app.tokens.input', inputTokens);
span.end();

5. Dashboard Templates

Engineering Dashboard (Grafana)

Top 5 Costliest Microservices (Bar chart)
Idle Resource Count (Single stat)
Compute Efficiency (CPU utilization vs. Cost)
Data Egress by Region (Pie chart)

Product/Executive Dashboard

Revenue vs. Infrastructure Cost (Area chart)
Margin per Feature (Heatmap)
Cost per Daily Active User (DAU) (Line chart)

6. Tools Ecosystem

Native Cloud Tools

AWS Cost Explorer: Best for monthly trends and filtered views.
AWS Cost Anomaly Detection: Uses ML to flag unusual spend automatically.
GCP Recommender: Suggests specific sizing changes to save money.

Specialized Tools

CloudHealth / Cloudability: Enterprise-grade cost allocation and multi-cloud reporting.
Kubecost: The standard for Kubernetes. It models costs based on pod resource requests.
Infracost: A CLI tool that runs in CI/CD to tell you how much a Pull Request will cost before it's merged.

7. Chargeback vs. Showback

How do you hold teams accountable?

Model	Description	Pros	Cons
Showback	Reporting costs to teams without actually billing their budgets.	Low friction, creates awareness.	No "teeth"; teams can ignore.
Chargeback	Directly deducting cloud costs from a department's real budget.	Forces accountability, drives optimization.	High administrative overhead.

8. Cost Forecasting

Forecasting helps avoid end-of-quarter budget surprises.

Linear Projection: NextMonth = ThisMonthAverage * GrowthRate.
Seasonal aware: Accounting for peak periods like Black Friday or holiday sales.
Scenario Planning: "If we double our user base, what happens to our NAT Gateway costs?"

9. Common Optimization Targets

S3 Storage Class Analysis: Finding buckets that could move to Infrequent Access.
Database Query Analysis: Finding a single query that causes high CPU/IOPS across thousands of DB connections.
Zombie Snapshots: Deleting EBS snapshots older than 90 days.

10. Implementation Checklist

Tagging Enforcement: Do resources without tags trigger an alert or auto-deletion?
Accountability: Does every Team have a dashboard showing their spend?
Thresholds: Are there daily spending alerts set at 20% above "normal"?
Unit Economics: Do we know the infrastructure cost of a single user transaction?
Forecasting: Are we predicting next month's bill with < 10% error?

Related Skills

42-cost-engineering/cloud-cost-models
42-cost-engineering/budget-guardrails
40-system-resilience/chaos-engineering (using chaos to test cost stability)

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/cost-observability
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Cost Observability and Monitoring

Overview

1. Key Cost Metrics to Track

2. Cost Attribution and Tagging Strategy

The Standard Tagging Schema

Enforcement Policy (Terraform/OpenTofu)

3. Cost Anomaly Detection

Types of Anomalies

Anomaly Alert Example (Slack/PagerDuty)

4. Application-Level Cost Tracking

OpenTelemetry for Cost

5. Dashboard Templates

Engineering Dashboard (Grafana)

Product/Executive Dashboard

6. Tools Ecosystem

Native Cloud Tools

Specialized Tools

7. Chargeback vs. Showback

8. Cost Forecasting

9. Common Optimization Targets

10. Implementation Checklist

Related Skills

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state