Terraform at Scale: Lessons from Managing 500+ Resources
DEV Community

Terraform at Scale: Lessons from Managing 500+ Resources

Problem 1: Monolithic State

Everything was in one state file. VPCs, databases, Kubernetes clusters, DNS, IAM - all in one giant blob.

Before:

  • 1 state file, 500+ resources
  • terraform plan: 8 minutes
  • terraform apply: timeout risk
  • Blast radius: everything

Solution: State Decomposition

infrastructure/
โ”œโ”€โ”€ network/    # VPCs, subnets, security groups
โ”œโ”€โ”€ data/       # RDS, ElastiCache, S3
โ”œโ”€โ”€ compute/    # EKS, ASGs, Launch templates
โ”œโ”€โ”€ dns/        # Route53 zones and records
โ”œโ”€โ”€ iam/        # Roles, policies, users
โ””โ”€โ”€ monitoring/ # CloudWatch, SNS topics

Each directory = separate state file. Use data sources to reference across boundaries:

# compute/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
  }
}

Result: 6 state files, 60-100 resources each. Plan time: 45 seconds.

Problem 2: Environment Drift

Dev, staging, and prod drifted constantly because each was copy-pasted.

Solution: Modules + Terragrunt

modules/
โ”œโ”€โ”€ eks-cluster/
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ””โ”€โ”€ outputs.tf
โ””โ”€โ”€ rds-instance/
    โ”œโ”€โ”€ main.tf
    โ”œโ”€โ”€ variables.tf
    โ””โ”€โ”€ outputs.tf

environments/
โ”œโ”€โ”€ dev/
โ”‚   โ””โ”€โ”€ terragrunt.hcl
โ”œโ”€โ”€ staging/
โ”‚   โ””โ”€โ”€ terragrunt.hcl
โ””โ”€โ”€ prod/
    โ””โ”€โ”€ terragrunt.hcl
# environments/prod/terragrunt.hcl
terraform {
  source = "../../modules/eks-cluster"
}

inputs = {
  cluster_name  = "prod-main"
  node_count    = 10
  instance_type = "m5.2xlarge"
  multi_az      = true
}

Problem 3: Dangerous Applies

Anyone could terraform apply to production from their laptop.

Solution: CI/CD Only

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -out=plan.tfplan
      - run: terraform show -json plan.tfplan > plan.json
      # Post plan as PR comment
      - uses: actions/github-script@v7
        with:
          script: |
            const plan = require('./plan.json');
            const adds = plan.resource_changes.filter(c => c.change.actions.includes('create')).length;
            const changes = plan.resource_changes.filter(c => c.change.actions.includes('update')).length;
            const deletes = plan.resource_changes.filter(c => c.change.actions.includes('delete')).length;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: `## Terraform Plan\n+${adds} ~${changes} -${deletes}\n\n${deletes > 0 ? 'โš  RESOURCES WILL BE DESTROYED' : ''}`
            });

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires approval
    steps:
      - run: terraform apply plan.tfplan

Problem 4: State Locks

Multiple engineers running plan simultaneously caused state lock conflicts.

Solution: Remote State with DynamoDB Locking

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Plus: only CI/CD runs apply. Humans run plan locally with -lock=false for quick checks.

Results

Metric Before After
Plan time 8 min 45 sec
Apply failures 3/week 0.5/week
State conflicts Daily Never
Env drift incidents Monthly None in 6 months
Time to provision new env 2 days 30 minutes

If you want AI-powered infrastructure management that catches drift before it causes outages, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo BSc ยท MSc ยท MBA ยท PhD, Founder & CEO, Nova AI Ops.

Comments

No comments yet. Start the discussion.