Agent skill

slo-sli-sla

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/slo-sli-sla

SKILL.md

📈 Skill: Service Level Objectives (SLO/SLI/SLA)

📋 Metadata

Atributo	Valor
ID	`sre-slo-sli-sla`
Nivel	🔴 Avanzado
Versión	1.0.0
Keywords	`slo`, `sli`, `sla`, `error-budget`, `service-level`, `reliability`, `availability`
Referencia	Google SRE Book - SLIs, SLAs, SLOs

🔑 Keywords para Invocación

slo
sli
sla
error-budget
service-level
reliability
availability
@skill:slo

Ejemplos de Prompts

Implementa SLOs y error budgets para servicios críticos

Define SLIs y configura monitoreo de SLOs

Setup SLA compliance y error budget tracking

@skill:slo - Sistema completo de SLO/SLI/SLA

📖 Descripción

Service Level Objectives (SLOs) definen objetivos de confiabilidad medibles para servicios. Este skill cubre la definición de SLIs (Service Level Indicators), creación de SLOs, cálculo de error budgets, y monitoreo de compliance con SLAs.

✅ Cuándo Usar Este Skill

Servicios en producción con SLAs
Productos con requisitos de confiabilidad
Stakeholders requieren métricas de confiabilidad
Error budget management
Capacity planning basado en SLOs

❌ Cuándo NO Usar Este Skill

Prototipos sin usuarios
Servicios internos sin requisitos de confiabilidad
Sistemas legacy sin capacidad de medición

🏗️ Framework SLO

SLI (Service Level Indicator)
    ↓
SLO (Service Level Objective) ← Error Budget
    ↓
SLA (Service Level Agreement)

💻 Implementación

📁 Scripts Ejecutables: Este skill incluye scripts Python ejecutables en la carpeta scripts/:

error_budget.py - Calculadora de error budget (CLI)

slo_api.py - API REST para consultar SLOs desde Prometheus

requirements.txt - Dependencias Python

Ver ejemplos de uso en examples/usage_example.py

1. Definición de SLIs

1.1 SLI Types

yaml

# sli-definitions.yml
service_level_indicators:
  # Availability SLI
  - name: request_availability
    type: availability
    metric: |
      (
        sum(rate(http_requests_total{status!~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      )
    window: 5m
    description: "Percentage of successful requests"

  # Latency SLI
  - name: request_latency_p99
    type: latency
    metric: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )
    window: 5m
    percentile: 99
    description: "99th percentile request latency"

  # Freshness SLI
  - name: data_freshness
    type: freshness
    metric: |
      time() - last_successful_data_update_timestamp
    threshold: 3600  # 1 hour
    description: "Age of most recent data update"

  # Correctness SLI
  - name: data_correctness
    type: correctness
    metric: |
      sum(rate(data_validation_errors_total[5m]))
      /
      sum(rate(data_operations_total[5m]))
    description: "Percentage of correct data operations"

2. SLO Definitions

yaml

# slo-definitions.yml
service_level_objectives:
  - service: payment-service
    slo_name: availability
    sli: request_availability
    target: 99.95  # 99.95% availability
    window: 30d
    description: "Payment service must be available 99.95% of the time"
    
  - service: payment-service
    slo_name: latency_p99
    sli: request_latency_p99
    target: 1.0  # 1 second
    window: 30d
    description: "99th percentile latency must be under 1 second"
    
  - service: user-service
    slo_name: availability
    sli: request_availability
    target: 99.9  # 99.9% availability
    window: 30d
    description: "User service must be available 99.9% of the time"

  - service: analytics-service
    slo_name: data_freshness
    sli: data_freshness
    target: 3600  # 1 hour
    window: 30d
    description: "Analytics data must be updated within 1 hour"

3. Error Budget Calculation

Script ejecutable: scripts/error_budget.py

Este script calcula y gestiona error budgets para SLOs. Puede ejecutarse como CLI o importarse como módulo.

Cuándo ejecutar

Análisis manual: Calcular error budget para reportes o análisis
Validación: Verificar si un servicio está consumiendo demasiado error budget
Planificación: Calcular cuántos errores se pueden permitir antes de violar el SLO
Integración: Importar la clase ErrorBudget en otros scripts

Uso como CLI

bash

# Instalación (no requiere dependencias externas)
cd skills/system-reliability-engineering/slo-sli-sla

# Ejemplo básico
python scripts/error_budget.py \
  --slo-target 0.9995 \
  --total-requests 1000000 \
  --error-requests 400

# Modo interactivo
python scripts/error_budget.py \
  --slo-target 0.9995 \
  --interactive

# Ventana personalizada
python scripts/error_budget.py \
  --slo-target 0.9999 \
  --window-days 7 \
  --total-requests 500000 \
  --error-requests 10

Uso como módulo

python

from scripts.error_budget import ErrorBudget

# Crear instancia
error_budget = ErrorBudget(slo_target=0.9995, window_days=30)

# Calcular estado
status = error_budget.calculate_remaining_budget(
    total_requests=1_000_000,
    error_requests=400
)

print(f"Status: {status['status']}")
print(f"Remaining: {status['remaining_percentage']:.2f}%")

Ejemplos completos

Ver examples/usage_example.py para más ejemplos de uso programático.

4. Prometheus SLO Queries

promql

# SLO Compliance Query
# Availability SLO: 99.95% over 30 days

# Current availability (last 30 days)
(
  sum(rate(http_requests_total{service="payment-service", status!~"5.."}[5m]))[30d:]
  /
  sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
) >= 0.9995

# Error budget remaining
(
  1 - (
    sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))[30d:]
    /
    sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
  )
) / (1 - 0.9995)

# Error budget consumption rate
(
  sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))[30d:]
  /
  sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
) / (1 - 0.9995)

# Burn rate (how fast we're consuming error budget)
# 2x burn rate = using 2x the allowed error rate
(
  sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{service="payment-service"}[5m]))
) / (1 - 0.9995) / (30 * 24 * 3600) * 86400

# Alert when error budget will be exhausted in 4 hours
(
  (
    sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))[30d:]
    /
    sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
  ) / (1 - 0.9995)
  +
  (
    (
      sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="payment-service"}[5m]))
    ) / (1 - 0.9995) / (30 * 24 * 3600) * 14400
  )
) >= 1

5. SLO Monitoring Dashboard (Grafana)

json

{
  "dashboard": {
    "title": "SLO Dashboard - Payment Service",
    "panels": [
      {
        "title": "Availability SLO (99.95%)",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\", status!~\"5..\"}[5m]))[30d:] / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))[30d:]",
            "legendFormat": "Current Availability"
          },
          {
            "expr": "0.9995",
            "legendFormat": "SLO Target"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Budget Remaining",
        "targets": [
          {
            "expr": "(1 - (sum(rate(http_requests_total{service=\"payment-service\", status=~\"5..\"}[5m]))[30d:] / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))[30d:])) / (1 - 0.9995)",
            "legendFormat": "Remaining Budget %"
          }
        ],
        "type": "graph",
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.25],
                "type": "lt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "params": [],
                "type": "last"
              },
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "for": "5m",
          "name": "Error Budget at Risk",
          "notifications": ["slack-alerts"]
        }
      },
      {
        "title": "Error Budget Burn Rate",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{service=\"payment-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"payment-service\"}[5m]))) / (1 - 0.9995) / (30 * 24 * 3600) * 86400",
            "legendFormat": "Daily Burn Rate"
          },
          {
            "expr": "1",
            "legendFormat": "1x (Normal)"
          },
          {
            "expr": "14",
            "legendFormat": "14x (Critical)"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

6. SLO Alert Rules

yaml

# prometheus/alerts/slo-alerts.yml
groups:
  - name: slo_alerts
    interval: 30s
    rules:
      # Error budget at risk (25% remaining)
      - alert: ErrorBudgetAtRisk
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))[30d:]
              /
              sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
            )
          ) / (1 - 0.9995) < 0.25
        for: 5m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Error budget at risk for {{ $labels.service }}"
          description: "Only {{ $value }}% of error budget remaining"

      # Error budget will be exhausted soon
      - alert: ErrorBudgetExhaustionImminent
        expr: |
          (
            (
              sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))[30d:]
              /
              sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
            ) / (1 - 0.9995)
            +
            (
              (
                sum(rate(http_requests_total{service="payment-service", status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total{service="payment-service"}[5m]))
              ) / (1 - 0.9995) / (30 * 24 * 3600) * 14400
            )
          ) >= 1
        for: 5m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Error budget will be exhausted in 4 hours"
          description: "Current burn rate will consume remaining budget"

      # SLO violation
      - alert: SLOViolation
        expr: |
          (
            sum(rate(http_requests_total{service="payment-service", status!~"5.."}[5m]))[30d:]
            /
            sum(rate(http_requests_total{service="payment-service"}[5m]))[30d:]
          ) < 0.9995
        for: 1m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "SLO violation for {{ $labels.service }}"
          description: "Availability is {{ $value }}, below SLO target of 99.95%"

7. SLO API Service

Script ejecutable: scripts/slo_api.py

Servicio REST FastAPI para consultar el estado de SLOs y error budgets desde métricas de Prometheus.

Cuándo ejecutar

Servicio en producción: Como microservicio corriendo continuamente
Integración con dashboards: Para alimentar dashboards de Grafana u otros
Alertas automatizadas: Para sistemas que necesitan consultar SLOs programáticamente
APIs internas: Para que otros servicios consulten el estado de SLOs

Instalación

bash

cd skills/system-reliability-engineering/slo-sli-sla
pip install -r scripts/requirements.txt

Ejecución

bash

# Desarrollo (con auto-reload)
uvicorn scripts.slo_api:app --reload --host 0.0.0.0 --port 8000

# Producción
uvicorn scripts.slo_api:app --host 0.0.0.0 --port 8000 --workers 4

# Con Prometheus personalizado
PROMETHEUS_URL=http://prometheus:9090 uvicorn scripts.slo_api:app --reload

Endpoints

Health Check:

bash

curl http://localhost:8000/health

SLO Compliance:

bash

curl "http://localhost:8000/slo/payment-service/compliance?slo_target=0.9995&window_days=30"

Error Budget Status:

bash

curl "http://localhost:8000/slo/payment-service/error-budget?slo_target=0.9995&window_days=30"

Documentación API

Una vez ejecutando, accede a:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Configuración

Variables de entorno:

PROMETHEUS_URL: URL de Prometheus (default: http://localhost:9090)

Ejemplo de respuesta

json

{
  "service": "payment-service",
  "slo_target": 0.9995,
  "current_availability": 0.9998,
  "is_compliant": true,
  "error_budget_remaining": 0.6,
  "window_days": 30,
  "burn_rate": 0.8,
  "days_to_exhaustion": 22.5,
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00"
}

🎯 Mejores Prácticas

1. SLI Selection

✅ DO:

Choose SLIs that matter to users
Make SLIs easy to measure
Use percentiles for latency (p50, p95, p99)
Align SLIs with business goals

❌ DON'T:

Use internal metrics as SLIs
Make SLIs too complex
Ignore user experience

2. SLO Targets

✅ DO:

Start with achievable targets
Review and adjust regularly
Different targets for different services
Consider error budgets

❌ DON'T:

Set targets too high initially
Set same target for all services
Ignore error budget consumption

3. Error Budgets

✅ DO:

Track error budget consumption
Alert when budget is at risk
Use budget for release decisions
Review budget trends

❌ DON'T:

Ignore error budget consumption
Deploy when budget is exhausted
Set budgets too conservatively

🚨 Troubleshooting

SLO Violations

Analyze root cause
Check recent changes
Review error patterns
Implement fixes
Document in post-mortem

Error Budget Exhaustion

Stop new releases
Focus on stability
Fix known issues
Replenish budget before new features

📚 Recursos Adicionales

Versión: 1.0.0
Última actualización: Diciembre 2025
Total líneas: 1,100+

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/slo-sli-sla
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

📈 Skill: Service Level Objectives (SLO/SLI/SLA)

📋 Metadata

🔑 Keywords para Invocación

Ejemplos de Prompts

📖 Descripción

✅ Cuándo Usar Este Skill

❌ Cuándo NO Usar Este Skill

🏗️ Framework SLO

💻 Implementación

1. Definición de SLIs

1.1 SLI Types

2. SLO Definitions

3. Error Budget Calculation

Cuándo ejecutar

Uso como CLI

Uso como módulo

Ejemplos completos

4. Prometheus SLO Queries

5. SLO Monitoring Dashboard (Grafana)

6. SLO Alert Rules

7. SLO API Service

Cuándo ejecutar

Instalación

Ejecución

Endpoints

Documentación API

Configuración

Ejemplo de respuesta

🎯 Mejores Prácticas

1. SLI Selection

2. SLO Targets

3. Error Budgets

🚨 Troubleshooting

SLO Violations

Error Budget Exhaustion

📚 Recursos Adicionales

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state