Terraform at Scale: Lessons from Managing 500+ Resources
Problem 1: Monolithic State
Everything was in one state file. VPCs, databases, Kubernetes clusters, DNS, IAM - all in one giant blob.
Before:
- 1 state file, 500+ resources
terraform plan: 8 minutesterraform apply: timeout risk- Blast radius: everything
Solution: State Decomposition
infrastructure/
โโโ network/ # VPCs, subnets, security groups
โโโ data/ # RDS, ElastiCache, S3
โโโ compute/ # EKS, ASGs, Launch templates
โโโ dns/ # Route53 zones and records
โโโ iam/ # Roles, policies, users
โโโ monitoring/ # CloudWatch, SNS topics
Each directory = separate state file. Use data sources to reference across boundaries:
# compute/main.tf
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
vpc_config {
subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}
}
Result: 6 state files, 60-100 resources each. Plan time: 45 seconds.
Problem 2: Environment Drift
Dev, staging, and prod drifted constantly because each was copy-pasted.
Solution: Modules + Terragrunt
modules/
โโโ eks-cluster/
โ โโโ main.tf
โ โโโ variables.tf
โ โโโ outputs.tf
โโโ rds-instance/
โโโ main.tf
โโโ variables.tf
โโโ outputs.tf
environments/
โโโ dev/
โ โโโ terragrunt.hcl
โโโ staging/
โ โโโ terragrunt.hcl
โโโ prod/
โโโ terragrunt.hcl
# environments/prod/terragrunt.hcl
terraform {
source = "../../modules/eks-cluster"
}
inputs = {
cluster_name = "prod-main"
node_count = 10
instance_type = "m5.2xlarge"
multi_az = true
}
Problem 3: Dangerous Applies
Anyone could terraform apply to production from their laptop.
Solution: CI/CD Only
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -out=plan.tfplan
- run: terraform show -json plan.tfplan > plan.json
# Post plan as PR comment
- uses: actions/github-script@v7
with:
script: |
const plan = require('./plan.json');
const adds = plan.resource_changes.filter(c => c.change.actions.includes('create')).length;
const changes = plan.resource_changes.filter(c => c.change.actions.includes('update')).length;
const deletes = plan.resource_changes.filter(c => c.change.actions.includes('delete')).length;
github.rest.issues.createComment({
issue_number: context.issue.number,
body: `## Terraform Plan\n+${adds} ~${changes} -${deletes}\n\n${deletes > 0 ? 'โ RESOURCES WILL BE DESTROYED' : ''}`
});
apply:
needs: plan
if: github.ref == 'refs/heads/main'
environment: production # Requires approval
steps:
- run: terraform apply plan.tfplan
Problem 4: State Locks
Multiple engineers running plan simultaneously caused state lock conflicts.
Solution: Remote State with DynamoDB Locking
terraform {
backend "s3" {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Plus: only CI/CD runs apply. Humans run plan locally with -lock=false for quick checks.
Results
| Metric | Before | After |
|---|---|---|
| Plan time | 8 min | 45 sec |
| Apply failures | 3/week | 0.5/week |
| State conflicts | Daily | Never |
| Env drift incidents | Monthly | None in 6 months |
| Time to provision new env | 2 days | 30 minutes |
If you want AI-powered infrastructure management that catches drift before it causes outages, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo BSc ยท MSc ยท MBA ยท PhD, Founder & CEO, Nova AI Ops.
Comments
No comments yet. Start the discussion.