Agent skill
Terraform IaC Expert
Infrastructure as Code for AI workloads using Terraform across AWS, Azure, GCP, and OCI
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/terraform-iac-expert-frankxai-ai-architect
SKILL.md
Terraform IaC Expert
Expert in Terraform and Infrastructure as Code for deploying AI infrastructure across multi-cloud environments.
Project Structure
infrastructure/
├── modules/
│ ├── aws-bedrock/
│ ├── azure-openai/
│ ├── oci-genai/
│ └── vector-store/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
├── shared/
│ ├── networking/
│ └── security/
└── scripts/
Module Overview
| Module | Provider | Purpose |
|---|---|---|
aws-bedrock |
AWS | Bedrock, Knowledge Bases, VPC Endpoints |
azure-openai |
Azure | Azure OpenAI, AI Search, Private Endpoints |
oci-genai |
OCI | DACs, Endpoints, Agents, Knowledge Bases |
vector-store |
Multi | OpenSearch Serverless, Qdrant, Milvus |
Full module code: resources/modules.tf
AWS AI Infrastructure
Bedrock Module
hcl
module "aws_ai" {
source = "./modules/aws-bedrock"
prefix = "prod"
region = "us-east-1"
vpc_id = data.aws_vpc.main.id
private_subnet_ids = data.aws_subnets.private.ids
enable_private_endpoint = true
create_knowledge_base = true
}
Key Resources
- IAM roles for Bedrock access
- VPC endpoints for private connectivity
- Knowledge bases with OpenSearch
Azure AI Infrastructure
Azure OpenAI Module
hcl
module "azure_ai" {
source = "./modules/azure-openai"
openai_name = "prod-openai"
location = "eastus"
resource_group_name = azurerm_resource_group.ai.name
gpt4o_capacity = 100 # TPM
embedding_capacity = 50
enable_private_endpoint = true
}
Key Resources
- Cognitive Account (OpenAI kind)
- Model deployments (GPT-4o, embeddings)
- Private endpoints
- Azure AI Search
OCI AI Infrastructure
GenAI Module
hcl
module "oci_ai" {
source = "./modules/oci-genai"
prefix = "prod"
compartment_id = var.oci_compartment_id
cluster_type = "HOSTING"
unit_count = 10
unit_shape = "LARGE_COHERE"
create_agent = true
create_knowledge_base = true
}
Key Resources
- Dedicated AI Clusters (DAC)
- Model endpoints
- GenAI Agents
- Knowledge bases
Vector Store Infrastructure
OpenSearch Serverless (AWS)
hcl
module "vectors" {
source = "./modules/vector-store/aws-opensearch"
prefix = "prod"
vpc_endpoint_ids = [aws_vpc_endpoint.opensearch.id]
allowed_principals = [aws_iam_role.bedrock.arn]
}
Multi-Cloud Environment
hcl
# environments/prod/main.tf
terraform {
backend "s3" {
bucket = "terraform-state-ai-infra"
key = "prod/terraform.tfstate"
encrypt = true
}
}
provider "aws" { region = var.aws_region }
provider "azurerm" { features {} }
provider "oci" { ... }
# Deploy to all clouds
module "aws_ai" { source = "../../modules/aws-bedrock" ... }
module "azure_ai" { source = "../../modules/azure-openai" ... }
module "oci_ai" { source = "../../modules/oci-genai" ... }
Best Practices
State Management
- Remote state (S3, Azure Blob, OCI Object Storage)
- State locking (DynamoDB, Cosmos DB)
- Encrypt state at rest
- Separate state per environment
Variable Validation
hcl
variable "environment" {
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
Sensitive Data
- Use
sensitive = truefor outputs - Reference secrets from secret managers
- Never commit
.tfvarswith secrets
Tagging
hcl
default_tags {
tags = {
Environment = var.environment
Project = "ai-platform"
ManagedBy = "terraform"
}
}
CI/CD Integration
GitHub Actions
yaml
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -out=tfplan
- run: terraform apply -auto-approve tfplan
if: github.ref == 'refs/heads/main'
Key Patterns
- Plan on PR, apply on merge
- Use workspaces or directories for environments
- Lock state during apply
- Store plan artifacts
Managed Kubernetes GPU
EKS GPU Nodes
hcl
eks_managed_node_groups = {
gpu = {
instance_types = ["g5.2xlarge"]
ami_type = "AL2_x86_64_GPU"
taints = [{ key = "nvidia.com/gpu" ... }]
}
}
AKS GPU Nodes
hcl
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
vm_size = "Standard_NC24ads_A100_v4"
node_taints = ["nvidia.com/gpu=true:NoSchedule"]
}
OKE GPU Nodes
hcl
resource "oci_containerengine_node_pool" "gpu" {
node_shape = "BM.GPU.A100-v2.8"
}
Full examples: resources/modules.tf
Resources
Infrastructure as Code for enterprise AI deployments.
Didn't find tool you were looking for?