Agent skill

ops-devops-platform

DevOps and platform engineering patterns: Kubernetes, Terraform, GitOps, CI/CD, observability, incident response, and cloud-native ops.

Stars 50
Forks 11

Install this agent skill to your Project

npx add-skill https://github.com/vasilyu1983/AI-Agents-public/tree/main/frameworks/shared-skills/skills/ops-devops-platform

SKILL.md

DevOps Engineering — Quick Reference

This skill equips teams with actionable templates, checklists, and patterns for building self-service platforms, automating infrastructure with GitOps, deploying securely with DevSecOps, scaling with Kubernetes, ensuring reliability through SRE practices, and operating production systems with strong observability.

Modern baseline (2026): IaC (Terraform/OpenTofu/Pulumi), GitOps (Argo CD/Flux), Kubernetes (follow upstream supported releases), OpenTelemetry + Prometheus/Grafana, supply-chain security (SBOM + signing + provenance), policy-as-code (OPA/Gatekeeper or Kyverno), and eBPF-powered networking/security/observability (e.g., Cilium + Tetragon).


Quick Reference

Task Tool/Framework Command When to Use
Infrastructure as Code Terraform / OpenTofu terraform plan && terraform apply Provision cloud resources declaratively
GitOps Deployment Argo CD / Flux argocd app sync myapp Continuous reconciliation, declarative deployments
Container Build Docker Engine docker build -t app:v1 . Package applications with dependencies
Kubernetes Deployment kubectl / Helm (Kubernetes) kubectl apply -f deploy.yaml / helm upgrade app ./chart Deploy to K8s cluster, manage releases
CI/CD Pipeline GitHub Actions Define workflow in .github/workflows/ci.yml Automated testing, building, deploying
Security Scanning Trivy / Falco / Tetragon trivy image myapp:latest Vulnerability scanning, runtime security, eBPF enforcement
Monitoring & Alerts Prometheus + Grafana Configure ServiceMonitor and AlertManager Observability, SLO tracking, incident alerts
Load Testing k6 / Locust k6 run load-test.js Performance validation, capacity planning
Incident Response PagerDuty / Opsgenie Configure escalation policies On-call management, automated escalation
Platform Engineering Backstage / Port Deploy internal developer portal Self-service infrastructure, golden paths

Decision Tree: Choosing DevOps Approach

text
What do you need to accomplish?
    ├─ Infrastructure provisioning?
    │   ├─ Cloud-agnostic → Terraform or OpenTofu (OSS fork)
    │   ├─ Programming-first → Pulumi (TypeScript/Python/Go)
    │   ├─ AWS-specific → CloudFormation or Terraform/OpenTofu
    │   ├─ GCP-specific → Deployment Manager or Terraform/OpenTofu
    │   └─ Azure-specific → ARM/Bicep or Terraform/OpenTofu
    │
    ├─ Application deployment?
    │   ├─ Kubernetes cluster?
    │   │   ├─ Simple deploy → kubectl apply -f manifests/
    │   │   ├─ Complex app → Helm charts
    │   │   └─ GitOps workflow → ArgoCD or FluxCD
    │   └─ Serverless?
    │       ├─ AWS → Lambda + SAM/Serverless Framework
    │       ├─ GCP → Cloud Functions
    │       └─ Azure → Azure Functions
    │
    ├─ CI/CD pipeline setup?
    │   ├─ GitHub-based → GitHub Actions (template-github-actions.md)
    │   ├─ GitLab-based → GitLab CI
    │   ├─ Enterprise → Jenkins or Tekton
    │   └─ Security-first → Add SAST/DAST/SCA scans (template-ci-cd.md)
    │
    ├─ Observability & monitoring?
    │   ├─ Metrics → Prometheus + Grafana
    │   ├─ Distributed tracing → Jaeger or OpenTelemetry
    │   ├─ Logs → Loki or ELK stack
    │   ├─ eBPF-based → Cilium + Hubble (sidecarless)
    │   └─ Unified platform → Datadog or New Relic
    │
    ├─ Incident management?
    │   ├─ On-call rotation → PagerDuty or Opsgenie
    │   ├─ Postmortem → template-postmortem.md
    │   └─ Communication → template-incident-comm.md
    │
    ├─ Platform engineering?
    │   ├─ Self-service → Backstage or Port (internal developer portal)
    │   ├─ Policy enforcement → OPA/Gatekeeper
    │   └─ Golden paths → Template repositories + automation
    │
    └─ Security hardening?
        ├─ Container scanning → Trivy or Grype
        ├─ Runtime security → Falco or Sysdig
        ├─ Secrets management → HashiCorp Vault or cloud-native KMS
        └─ Compliance → CIS Benchmarks, template-security-hardening.md

When to Use This Skill

Claude should invoke this skill when users request:

  • Platform engineering patterns (self-service developer platforms, internal tools)
  • GitOps workflows (ArgoCD, FluxCD, declarative infrastructure management)
  • Infrastructure as Code patterns (Terraform, K8s manifests, policy as code)
  • CI/CD pipelines with DevSecOps (GitHub Actions, security scanning, SAST/DAST/SCA)
  • SRE incident management, escalation, and postmortem templates
  • eBPF-based observability (Cilium, Hubble, kernel-level insights, OpenTelemetry)
  • Kubernetes operational patterns (day-2 operations, resource management, workload placement)
  • Cloud-native monitoring (Prometheus, Grafana, unified observability platforms)
  • Team workflow, communication, handover guides, and runbooks

Resources (Best Practices Guides)

Operational best practices by domain:

  • DevOps/SRE Operations: references/devops-best-practices.md - Core patterns for safe infrastructure changes, deployments, and incident response
  • Platform Engineering: references/platform-engineering-patterns.md - Self-service platforms, golden paths, internal developer portals, policy as code
  • GitOps Workflows: references/gitops-workflows.md - Continuous reconciliation, multi-environment promotion, ArgoCD/FluxCD patterns, progressive delivery
  • SRE Incident Management: references/sre-incident-management.md - Severity classification, escalation procedures, blameless postmortems, alert correlation, and runbooks
  • Operational Standards: references/operational-patterns.md - Platform engineering blueprints, CI/CD safety, SLOs, and reliability drills
  • AIOps: references/aiops-patterns.md - Self-healing systems, automated operations, AI-assisted analysis

Templates (Copy-Paste Ready)

Production templates organized by tech stack:

AWS Cloud

  • assets/aws/template-aws-ops.md - AWS service operations and best practices
  • assets/aws/template-aws-terraform.md - Terraform modules for AWS infrastructure
  • assets/aws/template-cost-optimization.md - AWS cost optimization strategies

GCP Cloud

  • assets/gcp/template-gcp-ops.md - GCP service operations
  • assets/gcp/template-gcp-terraform.md - Terraform modules for GCP

Azure Cloud

  • assets/azure/template-azure-ops.md - Azure service operations

Kubernetes

  • assets/kubernetes/template-kubernetes-ops.md - Day-to-day K8s operations
  • assets/kubernetes/template-ha-dr.md - High availability and disaster recovery
  • assets/kubernetes/template-platform-api.md - Platform API patterns
  • assets/kubernetes/template-k8s-deploy.yaml - Deployment manifests

Docker

  • assets/docker/template-docker-ops.md - Container build, security, and operations

Kafka

  • assets/kafka/template-kafka-ops.md - Kafka cluster operations and streaming

Terraform & IaC

  • assets/terraform-iac/template-iac-terraform.md - Infrastructure as Code patterns
  • assets/terraform-iac/template-module.md - Reusable Terraform modules
  • assets/terraform-iac/template-env-promotion.md - Environment promotion strategies

CI/CD Pipelines

  • assets/cicd-pipelines/template-ci-cd.md - General CI/CD patterns
  • assets/cicd-pipelines/template-github-actions.md - GitHub Actions workflows
  • assets/cicd-pipelines/template-gitops.md - GitOps deployment patterns
  • assets/cicd-pipelines/template-release-safety.md - Safe release practices

Monitoring & Observability

  • assets/monitoring-observability/template-slo.md - Service level objectives
  • assets/monitoring-observability/template-alert-rules.md - Alert configuration
  • assets/monitoring-observability/template-observability-slo.md - Observability patterns
  • assets/monitoring-observability/template-loadtest-perf.md - Load testing and performance

Incident Response

  • assets/incident-response/template-postmortem.md - Incident postmortems
  • assets/incident-response/template-runbook-starter.md - Runbook starter template
  • assets/incident-response/template-incident-comm.md - Incident communication
  • assets/incident-response/template-incident-response.md - Incident response procedures

Security

  • assets/security/template-security-hardening.md - Security hardening checklists

Shared Utilities

Centralized patterns from software-clean-code-standard — extract, don't duplicate:

  • config-validation.md — Zod 3.24+, secrets management (Vault, 1Password, Doppler)
  • resilience-utilities.md — p-retry v6, circuit breaker, OTel spans
  • logging-utilities.md — pino v9 + OpenTelemetry integration
  • observability-utilities.md — OpenTelemetry SDK, tracing, metrics

Related Skills

Operations & Infrastructure:

  • ../qa-resilience/SKILL.md — Resilience, chaos engineering, and failure handling patterns
  • ../data-sql-optimization/SKILL.md — Database tuning, high availability, and migrations
  • ../qa-observability/SKILL.md — Monitoring, tracing, profiling, and performance optimization
  • ../qa-debugging/SKILL.md — Production debugging, log analysis, and root cause investigation

Security & Compliance:

  • ../software-security-appsec/SKILL.md — Application-layer security patterns and OWASP best practices

Software Development:

  • ../software-backend/SKILL.md — Service-level design and integration patterns
  • ../software-architecture-design/SKILL.md — System design, scalability, and architectural patterns
  • ../dev-api-design/SKILL.md — RESTful API design and versioning
  • ../dev-git-workflow/SKILL.md — Git branching strategies and CI/CD integration

Optional: AI/Automation (Related Skills):

  • ../ai-mlops/SKILL.md — ML model deployment, monitoring, and lifecycle management

Cost Governance & Capacity Planning

assets/cost-governance/template-cost-governance.md — Production cost control for cloud infrastructure.

Key Sections

  • Cost Governance Framework — Tagging strategy, budget alerts, anomaly detection
  • Cloud Cost Optimization — Right-sizing, reserved capacity, storage tiering
  • Kubernetes Cost Control — Resource requests/limits, quotas, autoscaler config
  • Capacity Planning — Utilization baseline, growth projections, scaling triggers
  • FinOps Practices — Monthly review agenda, optimization workflow

Do / Avoid

Do

  • Tag all resources at creation time
  • Set budget alerts before hitting limits
  • Review right-sizing recommendations monthly
  • Use spot/preemptible for fault-tolerant workloads
  • Set Kubernetes resource requests on all pods
  • Enable cluster autoscaler with scale-down
  • Document capacity planning assumptions
  • Run blameless postmortems after every SEV1/2

Avoid

Anti-Pattern Problem Fix
No cost tags Can't attribute spend Enforce tags in CI/CD
Dev runs 24/7 ~70% waste Scheduled shutdown
Over-provisioned Paying for idle capacity Monthly right-sizing review
No reservations On-demand premium 60-70% reserved coverage target
Alert fatigue Real issues missed SLO-based alerting, tuned thresholds
Snowflake infra Unreproducible, undocumented Everything in Terraform/IaC
Clickops drift Config outside IaC Enforce GitOps reconciliation
No postmortems Same incidents repeat Blameless postmortem for SEV1/2

Optional: AI/Automation (AIOps)

AI can assist with analysis and triage, but infrastructure/cost/incident decisions require human approval and an audit trail.

See references/aiops-patterns.md for self-healing systems, automated operations, AI-assisted analysis, and bounded claims.


Operational Deep Dives

See references/operational-patterns.md for:

  • Platform engineering blueprints and GitOps reconciliation checklists
  • DevSecOps CI/CD gates, SLO/SLI playbooks, and rollout verification steps
  • Observability patterns (eBPF), incident noise reduction, and reliability drills

External Resources

See data/sources.json for curated sources organized by tech stack:

  • Cloud Platforms: AWS, GCP, Azure documentation and best practices
  • Container Orchestration: Kubernetes, Helm, Kustomize, Docker
  • Infrastructure as Code: Terraform, OpenTofu, Pulumi, CloudFormation, ARM templates
  • CI/CD & GitOps: GitHub Actions, GitLab CI, Jenkins, ArgoCD, FluxCD
  • Streaming: Apache Kafka, Confluent, Strimzi
  • Monitoring: Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, Cilium/Hubble, Tetragon
  • SRE: Google SRE books, incident response patterns
  • Security: OWASP DevSecOps, CIS Benchmarks, Trivy, Falco
  • Tools: kubectl, k9s, stern, Cosign, Syft, Terragrunt

Use this skill as a hub for safe, modern, and production-grade DevOps patterns. All templates and patterns are operational—no theory or book summaries.


Trend Awareness Protocol

When users ask recommendation questions about DevOps, platform engineering, or cloud infrastructure, validate time-sensitive details (versions, deprecations, licensing, major releases) against primary sources.

Trigger Conditions

  • "What's the best tool for [Kubernetes/IaC/CI-CD/monitoring]?"
  • "What should I use for [container orchestration/GitOps/observability]?"
  • "What's the latest in DevOps/platform engineering?"
  • "Current best practices for [Terraform/ArgoCD/Prometheus]?"
  • "Is [tool/approach] still relevant in 2026?"
  • "[Kubernetes] vs [alternative]?" or "[ArgoCD] vs [FluxCD]?"
  • "Best cloud provider for [use case]?"
  • "What orchestration/monitoring tool should I use?"

Minimum Verification (Preferred Order)

  1. Check the official docs + release notes linked in data/sources.json for the specific tools you recommend.
  2. If internet access is available, confirm recent releases, breaking changes, and deprecations from those release pages.
  3. If internet access is not available, state that versions may have changed and focus on stable selection criteria (operational fit, ecosystem, maturity, team skills, compliance).

What to Report

After searching, provide:

  • Current landscape: What tools/approaches are popular NOW (not 6 months ago)
  • Emerging trends: New tools, patterns, or practices gaining traction
  • Deprecated/declining: Tools/approaches losing relevance or support
  • Recommendation: Based on fresh data, not just static knowledge

Example Topics (verify with fresh search)

  • Kubernetes versions and ecosystem tools (1.33+, Cilium, Gateway API)
  • Infrastructure as Code (Terraform, OpenTofu, Pulumi, CDK)
  • GitOps platforms (ArgoCD, FluxCD, Codefresh)
  • Observability stacks (OpenTelemetry, Grafana stack, Datadog)
  • Platform engineering tools (Backstage, Port, Kratix)
  • CI/CD platforms (GitHub Actions, GitLab CI, Dagger)
  • Cloud-native security (Falco, Trivy, policy engines)

Fact-Checking

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
  • Prefer primary sources; report source links and dates for volatile information.
  • If web access is unavailable, state the limitation and mark guidance as unverified.

Expand your agent's capabilities with these related and highly-rated skills.

vasilyu1983/AI-Agents-public

software-localisation

Production-grade i18n/l10n for React, Vue, Angular, and Next.js with ICU format and RTL support. Use when setting up or debugging localisation.

50 11
Explore
vasilyu1983/AI-Agents-public

ops-nuke-cicd

Design, implement, and troubleshoot NUKE-based CI/CD pipelines for .NET services with fast local-to-CI feedback loops. Use when creating or refactoring `nuke/Build.cs` target graphs, tuning `DependsOn`/`After`/`Triggers`/`OnlyWhenDynamic` behavior, orchestrating unit/API/DB test categories, merging and publishing coverage and test reports, building and pushing Docker images with traceable tags and digests, producing artifact contracts such as `deploy.env`, and diagnosing flaky or slow pipeline execution. For service code changes use $software-csharp-backend, for NUnit fixture design use $qa-testing-nunit, and for safe logging rewrites use $dev-structured-logs.

50 11
Explore
vasilyu1983/AI-Agents-public

qa-debugging

Systematic debugging for crashes, regressions, flakes, and production bugs. Use when diagnosing stack traces, logs, traces, or profiling data.

50 11
Explore
vasilyu1983/AI-Agents-public

ai-llm

Full LLM lifecycle skill — strategy selection, PEFT/LoRA, evaluation, and deployment. Use when building, fine-tuning, or operating LLM systems.

50 11
Explore
vasilyu1983/AI-Agents-public

qa-testing-playwright

E2E web testing with Playwright. Use when writing tests, debugging flakes, or setting up CI with selectors, sharding, and network mocking.

50 11
Explore
vasilyu1983/AI-Agents-public

software-frontend

Production-grade frontend for Next.js, Vue, Angular, and Svelte. Use when building UI, fixing hydration errors, or setting up a new web project.

50 11
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results