Agent skill

ops-platform-onboarding

Structured workflow for onboarding new services to the internal platform including infrastructure provisioning, observability setup, and documentation.

Stars 169
Forks 18

Install this agent skill to your Project

npx add-skill https://github.com/LerianStudio/ring/tree/main/.archive/ops-team/skills/ops-platform-onboarding

SKILL.md

Platform Onboarding Workflow

This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.


Onboarding Phases

Phase Focus Output
1. Requirements Gather service requirements Requirements doc
2. Golden Path Selection Choose deployment pattern Selected template
3. Infrastructure Provisioning Create service resources Infrastructure ready
4. Observability Setup Configure monitoring Dashboards/alerts
5. Security Configuration Apply security controls Security validated
6. Documentation Complete service docs Runbook ready
7. Handoff Transfer to service team Ownership confirmed

Phase 1: Requirements Gathering

Service Requirements Checklist

markdown
## Service Onboarding Request

**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD

### Service Information

| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |

### Resource Requirements

| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |

### Dependencies

| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |

### Compliance Requirements

- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________

Phase 2: Golden Path Selection

Available Golden Paths

Golden Path Use Case Includes
api-service REST/GraphQL APIs ALB, EKS, RDS, ElastiCache
worker-service Background processing SQS, EKS, auto-scaling
batch-job Scheduled jobs EventBridge, Lambda/Fargate
frontend-app Static sites, SPAs CloudFront, S3, API Gateway
data-pipeline ETL, streaming Kinesis, Glue, S3

Golden Path Selection Matrix

Requirement api-service worker-service batch-job
HTTP traffic Yes No No
Queue processing Optional Yes Optional
Scheduled runs No No Yes
Real-time Yes Near-real-time No
Auto-scaling Yes Yes N/A

Selection Template

markdown
## Golden Path Selection

**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]

### Rationale

1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]

### Customizations Required

| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |

### Approval

- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)

Phase 3: Infrastructure Provisioning

Provisioning Checklist

  • Namespace/project created
  • Compute resources provisioned
  • Database provisioned (if required)
  • Cache provisioned (if required)
  • Load balancer configured
  • DNS entries created
  • SSL certificates provisioned
  • Secrets stored in vault
  • IAM roles/service accounts created

Terraform/IaC Template

hcl
# Example service provisioning
module "service" {
  source = "platform/service-template"

  service_name    = var.service_name
  team            = var.team
  environment     = var.environment
  golden_path     = "api-service"

  # Compute
  cpu_request     = "500m"
  memory_request  = "512Mi"
  replicas_min    = 2
  replicas_max    = 10

  # Database
  database_enabled = true
  database_class   = "db.t3.medium"

  # Tags
  tags = {
    Team        = var.team
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}

Provisioning Verification

bash
# Verify namespace
kubectl get namespace [service-name]

# Verify compute
kubectl get deployment -n [service-name]

# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]

# Verify DNS
dig [service-name].internal.example.com

Phase 4: Observability Setup

Observability Checklist

  • Structured logging configured
  • Tracing instrumentation added
  • Metrics endpoints exposed
  • Service dashboard created
  • SLI/SLO defined
  • Alerts configured
  • On-call integration set up

Dashboard Template

Standard service dashboard includes:

Panel Metrics
Request rate requests/sec, by status code
Error rate 5xx rate, 4xx rate
Latency p50, p95, p99
Saturation CPU, memory utilization
Dependencies Upstream/downstream health

Alert Configuration

Alert Condition Severity Response
High error rate 5xx > 1% for 5m Critical Page on-call
High latency p99 > 1s for 5m Warning Alert team
Low availability uptime < 99.9% Critical Page on-call
Resource saturation CPU > 85% for 10m Warning Alert team

SLI/SLO Definition

markdown
## Service Level Objectives

**Service:** [name]
**SLO Version:** 1.0

| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |

### Error Budget

- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]

Phase 5: Security Configuration

Security Checklist

  • Network policies applied
  • Service mesh mTLS configured
  • Secrets management configured
  • IAM permissions follow least privilege
  • Security scanning in CI/CD
  • Dependency scanning enabled
  • WAF rules applied (if external)

Network Policy Template

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: service-policy
  namespace: [service-name]
spec:
  podSelector:
    matchLabels:
      app: [service-name]
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - port: 8080
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: database
      ports:
        - port: 5432

Security Review

markdown
## Security Configuration Review

**Service:** [name]
**Reviewer:** @security-team

| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |

Phase 6: Documentation

Required Documentation

Document Purpose Template
Service Overview What the service does README.md
Runbook Operational procedures runbook.md
Architecture Design decisions architecture.md
API Docs Interface documentation OpenAPI spec
On-call Guide Incident handling oncall.md

Runbook Template

markdown
## [Service Name] Runbook

### Service Overview

[Brief description of what the service does]

### Quick Reference

| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |

### Common Operations

#### Restart Service

```bash
kubectl rollout restart deployment/[service] -n [namespace]

Scale Service

bash
kubectl scale deployment/[service] -n [namespace] --replicas=X

Check Logs

bash
kubectl logs -l app=[service] -n [namespace] --tail=100

Troubleshooting

Symptom Possible Cause Resolution
High latency DB connection pool Scale DB or optimize queries
5xx errors Dependency down Check upstream services
OOM kills Memory leak Investigate heap, restart

Escalation

Level Contact When
L1 [team Slack channel] First response
L2 [on-call engineer] Cannot resolve in 15m
L3 [service owner] Critical/extended outage

---

## Phase 7: Handoff

### Handoff Checklist

- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred

### Handoff Template

```markdown
## Service Handoff Confirmation

**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]

### Completed Items

- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed

### Outstanding Items

| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |

### Acknowledgment

By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities

**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______

Anti-Rationalization Table

Rationalization Why It's WRONG Required Action
"Skip documentation, code is self-explanatory" On-call != developers Complete runbook
"We'll add observability later" Blind deployments = incidents Observability on day 1
"Golden path doesn't fit exactly" Customizations add complexity Justify every deviation
"Security can come later" Later = never for security Security from start
"Team can figure it out" Assumptions cause outages Complete handoff process

Dispatch Specialist

For platform onboarding tasks, dispatch:

Task tool:
  subagent_type: "ring:platform-engineer"
  prompt: |
    SERVICE ONBOARDING REQUEST
    Service: [name]
    Team: [team]
    Type: [API/Worker/Batch]
    Requirements: [summary]
    Golden Path: [if known]

Expand your agent's capabilities with these related and highly-rated skills.

LerianStudio/ring

ring:regulatory-templates

5-stage regulatory template orchestrator - manages setup, Gate 1 (analysis + auto-save), Gate 2 (validation), Gate 3 (generation), optional Test Gate, optional Contribution Gate. Supports any regulatory template (BACEN, RFB, CVM, SUSEP, COAF, or other).

169 18
Explore
LerianStudio/ring

ring:using-finops-team

3 FinOps agents: 2 for Brazilian financial regulatory compliance (BACEN, RFB, Open Banking), 1 for infrastructure cost estimation when onboarding customers. Supports any regulatory template via open intake system.

169 18
Explore
LerianStudio/ring

ring:regulatory-templates-gate1

Gate 1 sub-skill - performs regulatory compliance analysis, field mapping, batch approval by confidence level, and auto-saves dictionary after approval. Supports both pre-defined templates (dictionary exists) and new templates (any spec).

169 18
Explore
LerianStudio/ring

ring:regulatory-templates-gate2

Gate 2 sub-skill - validates uncertain mappings from Gate 1 and confirms all field specifications through testing.

169 18
Explore
LerianStudio/ring

ring:regulatory-templates-gate3

Gate 3 sub-skill - generates complete .tpl template file with all validated mappings from Gates 1-2.

169 18
Explore
LerianStudio/ring

ring:infrastructure-cost-estimation

Orchestrates infrastructure cost estimation with tier-based or custom TPS sizing. Offers pre-configured tiers (Starter/Growth/Business/Enterprise) or custom TPS input. Skill discovers components, asks shared/dedicated for EACH, selects environment(s), reads actual Helm chart configs, then dispatches agent for accurate calculations.

169 18
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results