Agent skills
reliability-strategy-builder

Agent skill

reliability-strategy-builder

Implements reliability patterns including circuit breakers, retries, fallbacks, bulkheads, and SLO definitions. Provides failure mode analysis and incident response plans. Use for "SRE", "reliability", "resilience", or "failure handling".

View SKILL.md on GitHub Repository

Stars 23

Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/patricio0312rev/skills/tree/main/architecture/reliability-strategy-builder

SKILL.md

Reliability Strategy Builder

Build resilient systems with proper failure handling and SLOs.

Reliability Patterns

1. Circuit Breaker

Prevent cascading failures by stopping requests to failing services.

typescript

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime?: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();

    if (this.failureCount >= 5) {
      this.state = "open";
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    const now = Date.now();
    const elapsed = now - this.lastFailureTime.getTime();
    return elapsed > 60000; // 1 minute
  }
}

2. Retry with Backoff

Handle transient failures with exponential backoff.

typescript

async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
  throw new Error("Max retries exceeded");
}

3. Fallback Pattern

Provide degraded functionality when primary fails.

typescript

async function getUserWithFallback(userId: string): Promise<User> {
  try {
    // Try primary database
    return await primaryDb.users.findById(userId);
  } catch (error) {
    logger.warn("Primary DB failed, using cache");

    // Fallback to cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) return cached;

    // Final fallback: return minimal user object
    return {
      id: userId,
      name: "Unknown User",
      email: "unavailable",
    };
  }
}

4. Bulkhead Pattern

Isolate failures to prevent resource exhaustion.

typescript

class ThreadPool {
  private pools = new Map<string, Semaphore>();

  constructor() {
    // Separate pools for different operations
    this.pools.set("critical", new Semaphore(100));
    this.pools.set("standard", new Semaphore(50));
    this.pools.set("background", new Semaphore(10));
  }

  async execute(priority: string, operation: () => Promise<any>) {
    const pool = this.pools.get(priority);
    await pool.acquire();

    try {
      return await operation();
    } finally {
      pool.release();
    }
  }
}

SLO Definitions

SLO Template

yaml

service: user-api
slos:
  - name: Availability
    description: API should be available for successful requests
    target: 99.9%
    measurement:
      type: ratio
      success: status_code < 500
      total: all_requests
    window: 30 days

  - name: Latency
    description: 95% of requests complete within 500ms
    target: 95%
    measurement:
      type: percentile
      metric: request_duration_ms
      threshold: 500
      percentile: 95
    window: 7 days

  - name: Error Rate
    description: Less than 1% of requests result in errors
    target: 99%
    measurement:
      type: ratio
      success: status_code < 400 OR status_code IN [401, 403, 404]
      total: all_requests
    window: 24 hours

Error Budget

Error Budget = 100% - SLO

Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Failure Mode Analysis

markdown

| Component   | Failure Mode | Impact | Probability | Detection               | Mitigation                     |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| Database    | Unresponsive | HIGH   | Medium      | Health checks every 10s | Circuit breaker, read replicas |
| API Gateway | Overload     | HIGH   | Low         | Request queue depth     | Rate limiting, auto-scaling    |
| Cache       | Eviction     | MEDIUM | High        | Cache hit rate          | Fallback to DB, larger cache   |
| Queue       | Backed up    | LOW    | Medium      | Queue depth metric      | Add workers, DLQ               |

Reliability Checklist

Infrastructure

Load balancer with health checks
Multiple availability zones
Auto-scaling configured
Database replication
Regular backups (tested!)

Application

Circuit breakers on external calls
Retry logic with backoff
Timeouts on all I/O
Fallback mechanisms
Graceful degradation

Monitoring

SLO dashboard
Error budgets tracked
Alerting on SLO violations
Latency percentiles (p50, p95, p99)
Dependency health checks

Operations

Incident response runbook
On-call rotation
Postmortem template
Disaster recovery plan
Chaos engineering tests

Incident Response Plan

Severity Levels

SEV1 (Critical): Complete service outage, data loss
  - Response time: <15 minutes
  - Page on-call immediately

SEV2 (High): Partial outage, degraded performance
  - Response time: <1 hour
  - Alert on-call

SEV3 (Medium): Minor issues, workarounds available
  - Response time: <4 hours
  - Create ticket

SEV4 (Low): Cosmetic issues, no user impact
  - Response time: Next business day
  - Backlog

Incident Response Steps

Acknowledge: Confirm receipt within SLA
Assess: Determine severity and impact
Communicate: Update status page
Mitigate: Stop the bleeding (rollback, scale, disable)
Resolve: Fix root cause
Document: Write postmortem

Best Practices

Design for failure: Assume components will fail
Fail fast: Don't let slow failures cascade
Isolate failures: Bulkhead pattern
Graceful degradation: Reduce functionality, don't crash
Monitor SLOs: Track error budgets
Test failure modes: Chaos engineering
Document runbooks: Clear incident response

Output Checklist

Circuit breakers implemented
Retry logic with backoff
Fallback mechanisms
Bulkhead isolation
SLOs defined (availability, latency, errors)
Error budgets calculated
Failure mode analysis
Monitoring dashboard
Incident response plan
Runbooks documented

Maintainer

patricio0312rev Core maintainer

Source details

Full Name: patricio0312rev/skills
Branch: main
Path in repo: architecture/reliability-strategy-builder
License: MIT License
Topics: ai claude-code claude cursor skills copilot-coding-agent cursor-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

patricio0312rev/skills

rate-limiting-abuse-protection

Implements rate limiting and abuse prevention with per-route policies, IP/user-based limits, sliding windows, safe error responses, and observability. Use when adding "rate limiting", "API protection", "abuse prevention", or "DDoS protection".

23 2

Explore

patricio0312rev/skills

rbac-permissions-builder

Implements role-based access control with permission matrix, route guards, policy functions, and UI permission hints. Provides middleware/guards, helper utilities, test suggestions, and permission checking patterns. Use when building "RBAC", "permissions", "access control", or "authorization".

23 2

Explore

patricio0312rev/skills

websocket-realtime-builder

Implements real-time features using WebSockets with Socket.io, rooms, authentication, and reconnection handling. Use when users request "real-time updates", "WebSocket", "Socket.io", "live chat", or "push notifications".

23 2

Explore

patricio0312rev/skills

webhook-receiver-hardener

Secures webhook receivers with signature verification, retry handling, deduplication, idempotency keys, and error responses. Provides verification code, dedupe storage strategy, runbook for incidents. Use when implementing "webhooks", "webhook security", "event receivers", or "third-party integrations".

23 2

Explore

patricio0312rev/skills

auth-module-builder

Implements secure authentication patterns including login/registration, session management, JWT tokens, password hashing, cookie settings, and CSRF protection. Provides auth routes, middleware, security configurations, and threat model documentation. Use when building "authentication", "login system", "JWT auth", or "session management".

23 2

Explore

patricio0312rev/skills

rest-to-graphql-migrator

Migrates REST APIs to GraphQL incrementally with schema stitching, REST datasources, and gradual endpoint migration. Use when users request "migrate to GraphQL", "REST to GraphQL", "GraphQL wrapper", or "API modernization".

23 2

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Reliability Strategy Builder

Reliability Patterns

1. Circuit Breaker

2. Retry with Backoff

3. Fallback Pattern

4. Bulkhead Pattern

SLO Definitions

SLO Template

Error Budget

Failure Mode Analysis

Reliability Checklist

Infrastructure

Application

Monitoring

Operations

Incident Response Plan

Severity Levels

Incident Response Steps

Best Practices

Output Checklist

Recommended Agent Skills

rate-limiting-abuse-protection

rbac-permissions-builder

websocket-realtime-builder

webhook-receiver-hardener

auth-module-builder

rest-to-graphql-migrator