Agent skills
senior-cloud-architect

Agent skill

senior-cloud-architect

View SKILL.md on GitHub Repository

Stars 71

Forks 21

Install this agent skill to your Project

npx add-skill https://github.com/borghei/Claude-Skills/tree/main/engineering/senior-cloud-architect

Metadata

Additional technical details for this skill

tags: cloud aws gcp azure architecture infrastructure terraform
author: borghei
domain: cloud-architecture
updated: 1774915200
version: 1.0.0
category: engineering

SKILL.md

Senior Cloud Architect

Expert cloud architecture and infrastructure design across AWS, GCP, and Azure.

Keywords

cloud, aws, gcp, azure, terraform, infrastructure, vpc, eks, ecs, lambda, cost-optimization, disaster-recovery, multi-region, iam, security, migration

Quick Start

bash

# Analyze infrastructure costs
python scripts/cost_analyzer.py --account production --period monthly

# Run DR validation
python scripts/dr_test.py --region us-west-2 --type failover

# Audit security posture
python scripts/security_audit.py --framework cis --output report.html

# Generate resource inventory
python scripts/inventory.py --accounts all --format csv

Tools

Script	Purpose
`scripts/cost_analyzer.py`	Analyze cloud spend by service, environment, and tag
`scripts/dr_test.py`	Validate disaster recovery failover procedures
`scripts/security_audit.py`	Audit against CIS benchmarks and compliance frameworks
`scripts/inventory.py`	Inventory all resources across accounts and regions

Cloud Platform Comparison

Service	AWS	GCP	Azure
Compute	EC2, ECS, EKS	GCE, GKE	VMs, AKS
Serverless	Lambda	Cloud Functions	Azure Functions
Storage	S3	Cloud Storage	Blob Storage
Database	RDS, DynamoDB	Cloud SQL, Spanner	SQL DB, CosmosDB
ML	SageMaker	Vertex AI	Azure ML
CDN	CloudFront	Cloud CDN	Azure CDN

Workflow 1: Design a Production AWS Architecture

Define requirements -- Identify compute, storage, database, and networking needs. Determine RTO/RPO targets.

Provision VPC with Terraform:

hcl

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"
  name    = "${var.project}-${var.environment}"
  cidr    = var.vpc_cidr
  azs             = ["${var.region}a", "${var.region}b", "${var.region}c"]
  private_subnets = var.private_subnets
  public_subnets  = var.public_subnets
  enable_nat_gateway   = true
  single_nat_gateway   = var.environment != "production"
  enable_dns_hostnames = true
  tags = local.common_tags
}

Deploy compute -- ECS/EKS in private subnets behind an ALB in public subnets. Use at least 2 AZs for redundancy.
Configure database -- RDS Multi-AZ for production, single-AZ for staging. Set backup retention to 30 days (production) or 7 days (non-production).
Add caching layer -- ElastiCache (Redis) between application and database.
Layer security -- WAF on CloudFront, NACLs on subnets, security groups on instances. Apply least-privilege IAM.
Validate -- Run python scripts/security_audit.py --framework cis and resolve all high-severity findings.

Reference Architecture

Route 53 (DNS) -> CloudFront + WAF -> ALB
  -> ECS/EKS Cluster (AZ-a) + ECS/EKS Cluster (AZ-b)
    -> ElastiCache (Redis)
      -> RDS Multi-AZ (Primary + Standby)

Workflow 2: Optimize Cloud Costs

Audit current spend -- python scripts/cost_analyzer.py --account production --period monthly

Right-size instances -- Identify instances with avg CPU <10% and max CPU <30% as downsize candidates:

python

# Pseudocode for right-sizing logic
if avg_cpu < 10 and max_cpu < 30:
    recommendation = 'downsize'
elif avg_cpu > 80:
    recommendation = 'upsize'
else:
    recommendation = 'optimal'

Convert steady-state workloads to Reserved Instances or Savings Plans:

Type	Discount	Commitment	Use Case
On-Demand	0%	None	Variable workloads
Reserved	30-72%	1-3 years	Steady-state
Savings Plans	30-72%	1-3 years	Flexible compute
Spot	60-90%	None	Fault-tolerant batch

Enforce cost allocation tags -- Require Environment, Project, Owner, CostCenter on all resources. Alert on untagged resources after 24 hours.
Validate -- Re-run cost analyzer and confirm savings target achieved.

Workflow 3: Plan Disaster Recovery

Select DR strategy based on RTO/RPO requirements:

Strategy RTO RPO Cost

Backup & Restore Hours Hours $

Pilot Light Minutes Minutes $$

Warm Standby Minutes Seconds $$$

Multi-Site Active Seconds Near-zero $$$$
Configure cross-region replication -- Database replication to secondary region. S3 cross-region replication for object storage.
Set up Route 53 failover routing -- Health checks on primary. Automatic DNS failover to secondary.
Define backup policy:
- Database: continuous replication, 35-day retention, cross-region, encrypted
- Application data: daily, 90-day retention, lifecycle to IA at 30d, Glacier at 90d
- Configuration: on-change via git + S3, unlimited retention
Test -- python scripts/dr_test.py --region us-west-2 --type failover and confirm RTO/RPO targets met.

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	Minutes	Minutes	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active	Seconds	Near-zero	$$$$

Workflow 4: Audit Security Posture

Run audit -- python scripts/security_audit.py --framework cis --output report.html
Review network segmentation -- Public subnets contain only NAT GW, ALB, bastion. Private subnets contain application tier. Data subnets contain RDS, Redis, Elasticsearch.

Enforce least-privilege IAM -- Every policy scoped to specific resources and conditions:

json

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::my-bucket/uploads/*",
  "Condition": {
    "StringEquals": { "aws:PrincipalTag/Team": "engineering" },
    "IpAddress": { "aws:SourceIp": ["10.0.0.0/8"] }
  }
}

Verify encryption -- Data encrypted at rest (KMS) and in transit (TLS 1.2+).
Validate -- Re-run audit and confirm all critical and high findings resolved.

AWS Well-Architected Pillars (Decision Checklist)

Operational Excellence: IaC everywhere? Monitoring and alerting? Runbooks for incidents?
Security: Least-privilege IAM? Encryption at rest and in transit? VPC segmentation?
Reliability: Multi-AZ? Auto-scaling? DR tested?
Performance: Right-sized instances? Caching layer? CDN for static assets?
Cost Optimization: Reserved capacity for steady-state? Spot for batch? Unused resources cleaned?
Sustainability: Efficient regions? Right-sized compute? Data lifecycle policies?

Reference Materials

Document	Path
AWS Patterns	references/aws_patterns.md
GCP Patterns	references/gcp_patterns.md
Multi-Cloud Strategies	references/multi_cloud.md
Cost Optimization Guide	references/cost_optimization.md

Troubleshooting

Problem	Cause	Solution
Cross-region latency exceeds 200ms	No regional caching or CDN configured	Deploy CloudFront/Cloud CDN with edge locations closest to user base; enable regional API Gateway caches
Terraform state lock conflicts across teams	Shared state backend without proper locking	Use DynamoDB (AWS) or GCS (GCP) state locking with per-team state file partitioning via workspaces
Multi-cloud DNS failover not triggering	Health check thresholds too lenient or misconfigured endpoints	Set health check interval to 10s, failure threshold to 3, and verify endpoint returns 200 on the exact path monitored
IAM permission errors after cross-account migration	Trust policies not updated for new account IDs	Update AssumeRole trust policies with correct account principals and external IDs; validate with `aws sts assume-role`
Cloud costs spike unexpectedly after scaling event	Auto-scaling max limits set too high or no budget alerts	Set hard max instance counts per ASG, configure billing alerts at 80%/100%/120% thresholds, and review Spot fallback behavior
VPC peering routes not propagating between clouds	Route tables missing entries for peered CIDR ranges	Add explicit route entries in both VPCs pointing peered CIDRs to the peering connection; verify no overlapping CIDRs
DR failover test fails with data inconsistency	Replication lag between primary and secondary regions	Switch to synchronous replication for critical databases or implement application-level consistency checks pre-failover

Success Criteria

99.99% availability SLA met across all production workloads with documented uptime reports
Cost optimization savings above 25% compared to on-demand baseline through Reserved Instances, Savings Plans, and right-sizing
RTO < 15 minutes and RPO < 1 minute validated through quarterly DR failover tests
Zero critical CIS benchmark findings in production accounts after security audit remediation
Infrastructure drift < 2% measured by Terraform plan diffs on scheduled compliance scans
Cross-region failover completes within 60 seconds with automated Route 53 health check validation
100% resource tagging compliance enforced via automated policy checks with no untagged resources older than 24 hours

Scope & Limitations

This skill covers:

Multi-cloud architecture design and comparison across AWS, GCP, and Azure
Infrastructure-as-Code with Terraform including VPC, compute, database, and networking
Disaster recovery planning, cross-region replication, and failover strategies
Cloud cost optimization, right-sizing, and reserved capacity planning

This skill does NOT cover:

Application-level code architecture or microservice design patterns (see senior-architect)
Kubernetes cluster internals, pod scheduling, or service mesh configuration (see senior-devops)
Security compliance frameworks beyond CIS benchmarks such as SOC 2, HIPAA, or GDPR (see ra-qm-team/ compliance skills)
CI/CD pipeline design, build automation, or deployment workflows (see senior-devops)

Integration Points

Skill	Integration	Data Flow
`senior-devops`	Infrastructure provisioning feeds into CI/CD deployment pipelines	Terraform outputs (endpoints, ARNs) → deployment configs
`senior-secops`	Security audit findings inform cloud hardening decisions	CIS benchmark results → security remediation tasks
`senior-architect`	Application architecture requirements drive cloud resource selection	Capacity requirements → compute/storage/network sizing
`aws-solution-architect`	AWS-specific deep dives complement multi-cloud strategy	Cloud platform comparison → AWS implementation details
`ra-qm-team/soc2-compliance`	Compliance requirements shape infrastructure security controls	Compliance matrices → IAM policies, encryption configs, audit logging
`senior-fullstack`	Fullstack application stacks deploy onto cloud infrastructure	Application stack definitions → ECS/EKS task definitions, RDS configs

Maintainer

borghei Core maintainer

Source details

Full Name: borghei/Claude-Skills
Branch: main
Path in repo: engineering/senior-cloud-architect
License: Other
Topics: claude-code automation ai-agents cursor developer-tools agentic-coding github-copilot prompt-engineering llm python ai-coding-assistant ai-skills windsurf openai-codex compliance-automation eu-ai-act gdpr-compliance iso-27001 role-based-agents soc2

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

borghei/Claude-Skills

churn-prevention

SaaS churn reduction covering cancel flow design, dynamic save offers, exit survey architecture, dunning sequences, payment recovery, win-back campaigns, and churn impact modeling.

71 21

Explore

borghei/Claude-Skills

popup-cro

Popup and modal optimization for conversion. Covers exit-intent, slide-ins, banners, timing optimization, frequency capping, audience targeting, compliance, and A/B testing frameworks for lead capture, promotions, and announcements.

71 21

Explore

borghei/Claude-Skills

competitor-alternatives

Competitor comparison and alternative page creation for SEO and sales enablement. Covers 4 page formats (singular alternative, plural alternatives, vs pages, competitor vs competitor), content architecture, research methodology, and centralized competitor data management.

71 21

Explore

borghei/Claude-Skills

contract-and-proposal-writer

Generate production-ready business documents including freelance contracts, project proposals, SOWs, NDAs, and MSAs with jurisdiction-aware clauses. Covers US (Delaware), EU (GDPR), UK, and DACH (German law) legal frameworks. Includes contract templates, clause libraries, and DOCX conversion. Use when starting client engagements, writing proposals, drafting partnership agreements, or needing GDPR-compliant data processing addenda.

71 21

Explore

borghei/Claude-Skills

pricing-strategy

SaaS pricing design and optimization covering value metric selection, tier architecture, price point research, pricing page design, price increase execution, and competitive pricing analysis.

71 21

Explore

borghei/Claude-Skills

referral-program

Referral and affiliate program design covering referral loop architecture, incentive design, trigger moment optimization, viral coefficient modeling, affiliate program structure, and optimization playbook.

71 21

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Senior Cloud Architect

Keywords

Quick Start

Tools

Cloud Platform Comparison

Workflow 1: Design a Production AWS Architecture

Reference Architecture

Workflow 2: Optimize Cloud Costs

Workflow 3: Plan Disaster Recovery

Workflow 4: Audit Security Posture

AWS Well-Architected Pillars (Decision Checklist)

Reference Materials

Troubleshooting

Success Criteria

Scope & Limitations

Integration Points

Recommended Agent Skills

churn-prevention

popup-cro

competitor-alternatives

contract-and-proposal-writer

pricing-strategy

referral-program