Terraform Infrastructure as Code

Overview

Terraform enables infrastructure as code (IaC), defining cloud resources declaratively in HCL (HashiCorp Configuration Language). Instead of manually creating resources through cloud consoles, you declare desired state in .tf files, version control them, and let Terraform create, update, or destroy resources to match that state. This approach provides consistency, repeatability, and auditability for infrastructure changes.

Infrastructure as code transforms infrastructure management from manual, error-prone processes into automated, tested workflows. Changes to infrastructure follow the same code review process as application code. Infrastructure configurations version control alongside application code, enabling correlation between infrastructure and application changes. Terraform state tracking enables detection of manual changes (drift) and safe, incremental updates.

Terraform's declarative approach abstracts cloud provider APIs - you declare "create an S3 bucket with encryption enabled" rather than calling specific AWS API endpoints. Terraform determines the necessary API calls, handles dependencies (create VPC before subnets), and manages resource lifecycle (update in place vs destroy and recreate). The same patterns apply across cloud providers (AWS, Azure, GCP) with provider-specific resource types.

For containerized infrastructure deployment, see Kubernetes Best Practices. For CI/CD pipeline integration, see GitLab CI/CD Pipelines.

Core Principles

Declarative Configuration: Define desired state, Terraform converges reality
Modular Design: Reusable modules for common patterns
Remote State: Store state centrally with locking
Immutable Infrastructure: Replace rather than modify resources
Version Control: All .tf files in Git, peer-reviewed changes
Separation of Concerns: Environments isolated via workspaces or directories
Least Privilege: IAM roles with minimal permissions for Terraform
Testing: Automated tests for infrastructure code (Terratest, validation)

Project Structure

Terraform project organization balances reusability (modules for common patterns) with environment isolation (separate state for dev/staging/prod). A well-structured project enables teams to manage infrastructure at scale without conflicts or duplication.

Understanding Project Organization

Monolithic projects (all resources in one directory) are simple initially but become unwieldy as infrastructure grows. A single state file means any change locks the entire infrastructure during apply. Multiple engineers cannot work on different resources simultaneously. The blast radius of errors is large - a mistake in one resource definition can destroy unrelated resources.

Modular projects decompose infrastructure into logical units (networking, compute, databases, monitoring). Each module is a reusable component with defined inputs (variables) and outputs. The root module composes these modules, passing values between them. This structure enables parallel development (network team works on VPC module while compute team works on EKS module), reduces blast radius (errors affect only the modified module), and promotes reuse (VPC module works for dev, staging, and prod).

Environment separation uses either separate directories (environments/dev/, environments/staging/, environments/prod/) or Terraform workspaces. Directory separation provides stronger isolation (separate state files, impossible to accidentally apply prod changes to dev) at the cost of some duplication. Workspaces share configuration but maintain separate state, reducing duplication but increasing risk of cross-environment changes. This environment strategy should align with your Git branching strategy to maintain consistency between code and infrastructure deployment workflows.

Recommended Structure

terraform/
├── modules/                           # Reusable modules
│   ├── networking/
│   │   ├── main.tf                   # VPC, subnets, routing tables
│   │   ├── variables.tf              # Input variables
│   │   ├── outputs.tf                # Output values
│   │   └── README.md                 # Module documentation
│   ├── compute/
│   │   ├── main.tf                   # EC2, ASG, launch templates
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── database/
│   │   ├── main.tf                   # RDS, parameter groups, subnets
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   └── kubernetes/
│       ├── main.tf                   # EKS cluster, node groups
│       ├── variables.tf
│       ├── outputs.tf
│       └── README.md
├── environments/
│   ├── dev/
│   │   ├── main.tf                   # Root module for dev
│   │   ├── variables.tf              # Environment-specific variables
│   │   ├── terraform.tfvars          # Dev variable values
│   │   ├── backend.tf                # Remote state configuration
│   │   └── outputs.tf                # Environment outputs
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   ├── backend.tf
│   │   └── outputs.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       ├── backend.tf
│       └── outputs.tf
├── global/                            # Shared resources (IAM, Route53)
│   ├── iam/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── dns/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── tests/                             # Terratest integration tests
│   ├── networking_test.go
│   ├── compute_test.go
│   └── go.mod
├── .terraform.lock.hcl                # Provider version lock file
├── .gitignore                         # Ignore .terraform/, *.tfstate
└── README.md                          # Project overview

Structure rationale: The modules/ directory contains reusable components. The environments/ directory uses these modules with environment-specific configurations (dev uses t3.small instances, prod uses t3.large). The global/ directory holds resources shared across environments (IAM roles, DNS zones). The tests/ directory contains automated tests validating infrastructure behavior.

Root Module Example

# environments/prod/main.tf
terraform {
  required_version = ">= 1.6.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = "production"
      ManagedBy   = "Terraform"
      Project     = "banking-platform"
      CostCenter  = "engineering"
    }
  }
}

# Networking module
module "networking" {
  source = "../../modules/networking"

  environment         = "prod"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  public_subnets     = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  private_subnets    = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  database_subnets   = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # HA: one NAT gateway per AZ

  tags = var.tags
}

# Database module
module "database" {
  source = "../../modules/database"

  environment = "prod"

  identifier          = "banking-prod-db"
  engine              = "postgres"
  engine_version      = "16.1"
  instance_class      = "db.r6g.xlarge"
  allocated_storage   = 100
  storage_encrypted   = true
  kms_key_id         = module.kms.database_key_arn

  database_name = "banking"
  username      = "dbadmin"
  password      = var.db_password  # From secret management

  vpc_id                = module.networking.vpc_id
  subnet_ids            = module.networking.database_subnet_ids
  vpc_security_group_ids = [module.networking.database_sg_id]

  backup_retention_period = 30
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  deletion_protection = true
  skip_final_snapshot = false

  tags = var.tags
}

# Kubernetes cluster
module "kubernetes" {
  source = "../../modules/kubernetes"

  environment = "prod"

  cluster_name    = "banking-prod"
  cluster_version = "1.28"

  vpc_id     = module.networking.vpc_id
  subnet_ids = module.networking.private_subnet_ids

  node_groups = {
    general = {
      desired_size   = 3
      min_size      = 3
      max_size      = 10
      instance_types = ["t3.large"]
      capacity_type  = "ON_DEMAND"
      disk_size     = 50
      labels = {
        role = "general"
      }
    }
    compute = {
      desired_size   = 2
      min_size      = 2
      max_size      = 20
      instance_types = ["c6i.2xlarge"]
      capacity_type  = "SPOT"
      disk_size     = 100
      labels = {
        role = "compute"
      }
      taints = [{
        key    = "workload"
        value  = "compute-intensive"
        effect = "NoSchedule"
      }]
    }
  }

  enable_irsa                     = true
  enable_cluster_autoscaler       = true
  enable_metrics_server           = true
  enable_aws_load_balancer_controller = true

  tags = var.tags
}

Module composition: The root module instantiates networking, database, and Kubernetes modules, passing outputs from one module as inputs to another (VPC ID from networking to database). The var.db_password comes from CI/CD secrets or environment variables, never committed to Git. The tags variable applied to all resources enables cost tracking and resource management.

Module Design Patterns

Terraform modules are reusable infrastructure components with defined interfaces (input variables, output values). Well-designed modules are composable (combine multiple modules to build complex infrastructure), testable (isolated functionality enables targeted testing), and maintainable (clear inputs/outputs, comprehensive documentation).

Understanding Module Design

Module granularity balances reusability and complexity. Fine-grained modules (one module per resource type: S3 bucket module, IAM role module) are highly reusable but require composing many modules to build infrastructure. Coarse-grained modules (entire application stack module) are convenient but less reusable and harder to test. The optimal granularity groups related resources into logical units (VPC module includes subnets, route tables, NAT gateways; EKS module includes cluster, node groups, IAM roles).

Variable validation enforces constraints declaratively, catching errors early (during terraform plan) rather than late (during terraform apply when cloud APIs reject invalid values). Validation rules document assumptions (CIDR blocks must be valid, instance types must match regex patterns) and prevent common mistakes.

Validation happens during the plan phase, before Terraform calls cloud provider APIs. This early validation saves time - a typo in an instance type fails immediately during plan rather than 5 minutes into an apply when AWS rejects the invalid instance type. Validation also provides clearer error messages than cloud provider API errors. Compare Terraform's "Instance type must be a t3 family instance" to AWS's cryptic "InvalidParameterValue: Invalid value for parameter instanceType."

Validation rules serve as executable documentation. Reading validation { condition = var.replica_count >= 1 && var.replica_count <= 10 } immediately communicates that replica count has bounds. This is clearer than a comment that might become outdated. Future consumers of the module see validation errors when they provide invalid inputs, guiding them to correct usage without reading documentation.

Output values expose module internals needed by other modules (VPC ID, database endpoint) while hiding implementation details. Outputs enable module composition - the database module doesn't need to know how the VPC module creates subnets, only that it provides subnet IDs.

Outputs create clear contracts between modules. The networking module guarantees to provide vpc_id, public_subnet_ids, private_subnet_ids, and database_subnet_ids. Consumers depend on these outputs, not on internal resource names. This abstraction enables refactoring - the networking module can change how it creates subnets (from count to for_each, from separate subnet resources to a subnet module) without breaking consumers, as long as outputs remain stable.

Output descriptions are critical. output "vpc_id" provides a value, but output "vpc_id" { description = "ID of the VPC for resource association" } explains what the value is and how to use it. Descriptions appear in terraform output command results and in documentation generation tools, making modules self-documenting.

Networking Module

# modules/networking/variables.tf
variable "environment" {
  description = "Environment name (dev, staging, prod)"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string

  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "VPC CIDR must be a valid IPv4 CIDR block."
  }
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)

  validation {
    condition     = length(var.availability_zones) >= 2
    error_message = "At least 2 availability zones required for high availability."
  }
}

variable "public_subnets" {
  description = "List of public subnet CIDR blocks"
  type        = list(string)
}

variable "private_subnets" {
  description = "List of private subnet CIDR blocks"
  type        = list(string)
}

variable "database_subnets" {
  description = "List of database subnet CIDR blocks"
  type        = list(string)
}

variable "enable_nat_gateway" {
  description = "Enable NAT gateway for private subnets"
  type        = bool
  default     = true
}

variable "single_nat_gateway" {
  description = "Use single NAT gateway instead of one per AZ (cost vs HA)"
  type        = bool
  default     = false
}

variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  default     = {}
}

Validation benefits: The environment validation prevents typos (var.environment = "production" fails, must be exactly "prod"). The vpc_cidr validation catches invalid CIDR blocks early. The availability_zones validation enforces high availability (multi-AZ deployment).

# modules/networking/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-vpc"
      Environment = var.environment
    }
  )
}

resource "aws_subnet" "public" {
  count = length(var.public_subnets)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnets[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-public-${var.availability_zones[count.index]}"
      Environment = var.environment
      Type        = "public"
      "kubernetes.io/role/elb" = "1"  # For AWS Load Balancer Controller
    }
  )
}

resource "aws_subnet" "private" {
  count = length(var.private_subnets)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnets[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-private-${var.availability_zones[count.index]}"
      Environment = var.environment
      Type        = "private"
      "kubernetes.io/role/internal-elb" = "1"  # For AWS Load Balancer Controller
    }
  )
}

resource "aws_subnet" "database" {
  count = length(var.database_subnets)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.database_subnets[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-database-${var.availability_zones[count.index]}"
      Environment = var.environment
      Type        = "database"
    }
  )
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-igw"
      Environment = var.environment
    }
  )
}

resource "aws_eip" "nat" {
  count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.availability_zones)) : 0

  domain = "vpc"

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-nat-eip-${count.index + 1}"
      Environment = var.environment
    }
  )

  depends_on = [aws_internet_gateway.main]
}

resource "aws_nat_gateway" "main" {
  count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.availability_zones)) : 0

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-nat-${count.index + 1}"
      Environment = var.environment
    }
  )

  depends_on = [aws_internet_gateway.main]
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-public-rt"
      Environment = var.environment
      Type        = "public"
    }
  )
}

resource "aws_route_table" "private" {
  count = var.single_nat_gateway ? 1 : length(var.availability_zones)

  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }

  tags = merge(
    var.tags,
    {
      Name        = "${var.environment}-private-rt-${count.index + 1}"
      Environment = var.environment
      Type        = "private"
    }
  )
}

resource "aws_route_table_association" "public" {
  count = length(var.public_subnets)

  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count = length(var.private_subnets)

  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = var.single_nat_gateway ? aws_route_table.private[0].id : aws_route_table.private[count.index].id
}

Design decisions: The count meta-argument creates multiple subnets from the provided lists. The single_nat_gateway variable controls cost vs high availability: true creates one NAT gateway (cheaper but single point of failure), false creates one per AZ (more expensive but HA). The depends_on ensures proper creation order (NAT gateways depend on internet gateway existing).

# modules/networking/outputs.tf
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "vpc_cidr" {
  description = "CIDR block of the VPC"
  value       = aws_vpc.main.cidr_block
}

output "public_subnet_ids" {
  description = "List of public subnet IDs"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "List of private subnet IDs"
  value       = aws_subnet.private[*].id
}

output "database_subnet_ids" {
  description = "List of database subnet IDs"
  value       = aws_subnet.database[*].id
}

output "nat_gateway_ips" {
  description = "Elastic IPs of NAT gateways"
  value       = aws_eip.nat[*].public_ip
}

Output design: Outputs expose only necessary information for module consumers. The VPC ID, subnet IDs, and NAT gateway IPs are needed by other modules. Internal details (route table IDs, subnet CIDR blocks) are not exposed unless required.

State Management

Terraform state tracks the mapping between Terraform resources and real-world infrastructure. State enables Terraform to determine what changes are necessary to achieve desired state (update in place, destroy and recreate, no changes needed). Proper state management is critical - corrupted or lost state can make infrastructure unmanageable.

Understanding Terraform State

Local state (default) stores state in terraform.tfstate file in the working directory. Local state works for individual developers experimenting but fails for teams: no locking (two engineers running terraform apply simultaneously corrupt state), no sharing (each engineer has separate state), no backup (lost laptop = lost state).

Remote state stores state centrally (S3, Azure Storage, Terraform Cloud, Consul) with locking (DynamoDB, Azure Blob leases) to prevent concurrent modifications. Remote state enables team collaboration (shared state), provides backup and versioning (S3 versioning, point-in-time recovery), and enables state locking (prevents simultaneous applies).

State locking prevents race conditions. When engineer A runs terraform apply, Terraform acquires a lock (DynamoDB entry, blob lease). If engineer B runs terraform apply while A's apply is running, Terraform detects the lock and fails immediately, preventing state corruption from concurrent modifications.

The lock mechanism works at the state file level - each environment's state file has its own lock. This means engineer A can apply changes to dev while engineer B applies to prod simultaneously (different state files, different locks). However, two engineers cannot modify the same environment concurrently.

Lock acquisition happens before state read. When terraform apply starts, it first attempts to create a DynamoDB item with a unique lock ID. DynamoDB's conditional write ensures only one process can create this item. If the item already exists (another apply in progress), DynamoDB returns an error, and Terraform immediately fails with lock information. This fail-fast behavior prevents wasted work - rather than planning changes that can't be applied, Terraform exits before reading state.

Remote State with S3 and DynamoDB

# environments/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "example-terraform-state"
    key            = "banking-platform/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
    dynamodb_table = "terraform-state-lock"

    # Prevent accidental state deletion
    skip_credentials_validation = false
    skip_metadata_api_check    = false
    skip_region_validation     = false
  }
}

Backend configuration: State is stored in S3 bucket example-terraform-state with server-side encryption (KMS key). The key parameter is the S3 object key - use unique keys per environment (banking-platform/dev/terraform.tfstate, banking-platform/staging/terraform.tfstate, banking-platform/prod/terraform.tfstate) to isolate environment state. DynamoDB table terraform-state-lock provides locking.

S3 bucket setup (one-time, using Docker containers for Terraform CLI isolation):

# bootstrap/state-backend.tf
# Run once to create S3 bucket and DynamoDB table for remote state

resource "aws_s3_bucket" "terraform_state" {
  bucket = "example-terraform-state"

  lifecycle {
    prevent_destroy = true  # Prevent accidental deletion
  }

  tags = {
    Name      = "Terraform State"
    ManagedBy = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  lifecycle {
    prevent_destroy = true  # Prevent accidental deletion
  }

  tags = {
    Name      = "Terraform State Lock"
    ManagedBy = "Terraform"
  }
}

resource "aws_kms_key" "terraform_state" {
  description             = "KMS key for Terraform state encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true

  tags = {
    Name      = "Terraform State Encryption"
    ManagedBy = "Terraform"
  }
}

resource "aws_kms_alias" "terraform_state" {
  name          = "alias/terraform-state"
  target_key_id = aws_kms_key.terraform_state.key_id
}

Bootstrap process: Run this configuration once with local state (terraform init && terraform apply). After the S3 bucket and DynamoDB table exist, configure remote backend in other projects. The prevent_destroy lifecycle rule prevents accidental deletion of state storage.

State versioning: S3 versioning enables recovery from state corruption. If terraform apply corrupts state, retrieve the previous version from S3 (aws s3api list-object-versions --bucket example-terraform-state --prefix banking-platform/prod/) and restore it.

State Locking

# Engineer A runs terraform apply
$ terraform apply
Acquiring state lock. This may take a few moments...

# Engineer B tries to run terraform apply simultaneously
$ terraform apply
Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        abc123-def456-ghi789
  Path:      example-terraform-state/banking-platform/prod/terraform.tfstate
  Operation: OperationTypeApply
  Who:       [email protected]
  Version:   1.6.0
  Created:   2025-01-08 10:30:00

Lock behavior: Engineer A's terraform apply creates a DynamoDB entry with lock metadata (who, when, operation type). Engineer B's concurrent terraform apply attempts to create the same DynamoDB entry, fails due to conditional check (entry already exists), and exits immediately with error. When A's apply completes, the lock is released (DynamoDB entry deleted), and B can retry.

Force unlock (dangerous):

# If apply crashes and doesn't release lock
$ terraform force-unlock abc123-def456-ghi789

# Only use if you're certain no other apply is running!

When to force unlock: If an apply process crashes (laptop dies, CI job killed) without releasing the lock, the lock remains in DynamoDB indefinitely. Use terraform force-unlock with the lock ID (from error message) to manually release. Never force unlock if another apply might be running - this can corrupt state.

State Encryption

State files contain sensitive data (database passwords from aws_db_instance.password, API keys from random_password resources). Encryption at rest (S3 SSE-KMS) protects state files in storage. Encryption in transit (HTTPS for S3 API calls) protects state during upload/download. This aligns with our data protection guidelines for handling sensitive information at rest and in transit.

# Sensitive values in state
resource "aws_db_instance" "main" {
  identifier = "banking-prod-db"

  username = "dbadmin"
  password = var.db_password  # Stored in state file in plaintext!

  # ... other configuration
}

# Mark outputs as sensitive to hide from console output
output "database_password" {
  description = "Database password"
  value       = aws_db_instance.main.password
  sensitive   = true  # Hides from terraform output
}

Sensitive data handling: Variables marked sensitive = true are hidden from console output but still stored in state. State encryption (S3 SSE-KMS) protects state at rest. Access control (IAM policies restricting S3 bucket access) limits who can read state. Never commit state files to Git - add *.tfstate and *.tfstate.backup to .gitignore.

Workspace Strategies

Terraform workspaces enable managing multiple environments (dev, staging, prod) from a single configuration. Each workspace has separate state while sharing configuration code.

Workspace vs Directory Approaches:

Understanding Workspaces

Workspace benefits: Reduced code duplication (one configuration serves multiple environments), easier to keep environments in sync (change to VPC module applies to all environments), simpler repository structure (no environment directories).

Workspace risks: Easy to apply changes to wrong environment (accidentally apply prod changes to dev), harder to enforce environment-specific controls (CODEOWNERS, branch protection), shared configuration may not fit all environments (prod needs larger instances than dev).

When to use workspaces: Small projects (<10 resources), similar environments (dev/staging/prod with same structure but different sizes), single-team ownership. Workspaces work well when environments differ only in scale - dev uses 2 replicas and t3.small instances, prod uses 10 replicas and t3.large instances, but the infrastructure topology is identical. The workspace name can drive conditional logic (terraform.workspace == "prod") to adjust resource sizes without duplicating configuration.

When to use directories: Large projects (>50 resources), divergent environments (prod has DR, monitoring, compliance resources dev doesn't), multi-team ownership (network team owns VPC, compute team owns EKS), need for environment-specific CODEOWNERS or PR requirements. Directory separation provides stronger guarantees - it's impossible to accidentally apply prod changes to dev because they have completely separate state files and configuration directories. This approach enables different approval workflows (prod changes require security team approval via CODEOWNERS, dev changes don't) and supports environments with fundamentally different architectures (prod has multi-region failover, dev is single-region).

Hybrid approach: Use directories for major environment differences (dev/staging/prod as separate directories) and workspaces within environments for temporary or feature-specific infrastructure (workspace per feature branch for integration testing). This combines the safety of directory isolation for production with the convenience of workspaces for ephemeral environments.

Workspace Usage

# List workspaces
$ terraform workspace list
  default
* dev
  staging
  prod

# Create new workspace
$ terraform workspace new staging

# Switch workspace
$ terraform workspace select prod

# Show current workspace
$ terraform workspace show
prod

# Delete workspace (must be empty)
$ terraform workspace delete dev

Workspace-Aware Configuration

# main.tf
locals {
  environment = terraform.workspace

  # Environment-specific configuration
  config = {
    dev = {
      instance_type = "t3.small"
      replica_count = 1
      backup_retention = 7
    }
    staging = {
      instance_type = "t3.medium"
      replica_count = 2
      backup_retention = 14
    }
    prod = {
      instance_type = "t3.large"
      replica_count = 3
      backup_retention = 30
    }
  }

  env_config = local.config[local.environment]
}

resource "aws_instance" "app" {
  count = local.env_config.replica_count

  ami           = data.aws_ami.ubuntu.id
  instance_type = local.env_config.instance_type

  tags = {
    Name        = "${local.environment}-app-${count.index + 1}"
    Environment = local.environment
  }
}

resource "aws_db_instance" "main" {
  identifier = "${local.environment}-database"

  instance_class = local.environment == "prod" ? "db.r6g.xlarge" : "db.t4g.medium"

  backup_retention_period = local.env_config.backup_retention

  # Production-specific features
  multi_az            = local.environment == "prod"
  deletion_protection = local.environment == "prod"

  tags = {
    Name        = "${local.environment}-database"
    Environment = local.environment
  }
}

# Backend configuration with workspace-specific state
terraform {
  backend "s3" {
    bucket         = "example-terraform-state"
    key            = "banking-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"

    # workspace_key_prefix creates separate state per workspace
    # dev:     banking-platform/env:/dev/terraform.tfstate
    # staging: banking-platform/env:/staging/terraform.tfstate
    # prod:    banking-platform/env:/prod/terraform.tfstate
    workspace_key_prefix = "env:"
  }
}

Workspace-aware design: The terraform.workspace variable provides the current workspace name. The local.config map defines environment-specific values. Resources use local.env_config to access appropriate configuration. Production-specific features (multi-AZ, deletion protection) use conditionals (local.environment == "prod").

State isolation: The workspace_key_prefix backend configuration creates separate S3 keys per workspace, isolating state. Switching workspaces (terraform workspace select staging) automatically uses the correct state file.

Variable Management

Terraform variables parameterize configurations, enabling reuse across environments. Variables have types (string, number, bool, list, map, object), default values, descriptions, and validation rules.

Understanding Variable Precedence

Terraform loads variables from multiple sources with a specific precedence order (highest to lowest):

Command-line flags (-var="instance_type=t3.large")
*.auto.tfvars files (automatically loaded, alphabetical order)
terraform.tfvars file
Environment variables (TF_VAR_instance_type)
Variable defaults in variables.tf

This precedence enables flexible configuration: defaults in variables.tf for common values, environment-specific values in terraform.tfvars, CI/CD overrides via environment variables or command-line flags.

Variable Definitions

# variables.tf
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"

  validation {
    condition     = can(regex("^t3\\.(nano|micro|small|medium|large|xlarge|2xlarge)$", var.instance_type))
    error_message = "Instance type must be a t3 family instance."
  }
}

variable "replica_count" {
  description = "Number of application replicas"
  type        = number
  default     = 2

  validation {
    condition     = var.replica_count >= 1 && var.replica_count <= 10
    error_message = "Replica count must be between 1 and 10."
  }
}

variable "enable_monitoring" {
  description = "Enable CloudWatch detailed monitoring"
  type        = bool
  default     = true
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
  default     = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  default     = {}
}

variable "database_config" {
  description = "Database configuration"
  type = object({
    engine         = string
    engine_version = string
    instance_class = string
    allocated_storage = number
    multi_az       = bool
  })

  default = {
    engine         = "postgres"
    engine_version = "16.1"
    instance_class = "db.t4g.medium"
    allocated_storage = 20
    multi_az       = false
  }
}

variable "db_password" {
  description = "Database master password"
  type        = string
  sensitive   = true  # Hides from console output

  validation {
    condition     = length(var.db_password) >= 16
    error_message = "Database password must be at least 16 characters."
  }
}

Type benefits: Typed variables catch errors early (passing a string when a number is expected fails during terraform plan). Complex types (object, map, list) structure related configuration (database config as single object rather than separate variables for engine, version, class).

Variable Values

# terraform.tfvars (not committed to Git for production)
environment = "prod"
aws_region  = "us-east-1"

instance_type  = "t3.large"
replica_count  = 3

enable_monitoring = true

availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]

tags = {
  Environment = "production"
  ManagedBy   = "Terraform"
  Project     = "banking-platform"
  CostCenter  = "engineering"
}

database_config = {
  engine            = "postgres"
  engine_version    = "16.1"
  instance_class    = "db.r6g.xlarge"
  allocated_storage = 100
  multi_az          = true
}

# Sensitive values from environment variable or secret management
# db_password = "value-from-TF_VAR_db_password-env-var"

Variable file practices: Commit terraform.tfvars for dev/staging with non-sensitive values. Never commit prod terraform.tfvars with sensitive values. Use environment variables (TF_VAR_db_password) or secret management (Vault, AWS Secrets Manager) for sensitive values in CI/CD.

Environment Variables

# Set variables via environment variables
export TF_VAR_db_password="super-secret-password"
export TF_VAR_environment="prod"
export TF_VAR_replica_count=5

terraform apply

CI/CD usage: GitLab CI/CD or GitHub Actions set environment variables from secrets. Terraform automatically loads variables prefixed with TF_VAR_. This pattern keeps secrets out of version control while enabling automated deployments.

# .gitlab-ci.yml
deploy:prod:
  stage: deploy
  environment:
    name: production
  script:
    - terraform init
    - terraform workspace select prod
    - terraform plan -out=tfplan
    - terraform apply tfplan
  variables:
    TF_VAR_environment: "prod"
    TF_VAR_aws_region: "us-east-1"
  secrets:
    TF_VAR_db_password:
      vault: production/database/password@secret

Secret Handling

Terraform configurations often require sensitive values (database passwords, API keys, private keys). Secrets must never appear in version control, state files should be encrypted, and secret management systems should provide secrets dynamically.

Understanding Secret Management

Hardcoded secrets in .tf files or terraform.tfvars are security vulnerabilities. Committed secrets leak to anyone with repository access. Revoked secrets remain in Git history forever. Hardcoded secrets cannot rotate without code changes.

Environment variables (TF_VAR_*) improve security (secrets not in Git) but have limitations: secrets in CI/CD job logs, no audit trail, manual rotation.

External secret management (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) provides centralized secret storage, automatic rotation, audit logging, and fine-grained access control. Terraform fetches secrets at runtime via data sources, never storing them in .tf files. For comprehensive secret management strategies across your entire stack, see Secrets Management.

Vault Integration

# Configure Vault provider
terraform {
  required_providers {
    vault = {
      source  = "hashicorp/vault"
      version = "~> 3.0"
    }
  }
}

provider "vault" {
  address = "https://vault.example.com"

  # Authenticate via Kubernetes ServiceAccount (when running in K8s)
  auth_login {
    path = "auth/kubernetes/login"

    parameters = {
      role = "terraform"
      jwt  = file("/var/run/secrets/kubernetes.io/serviceaccount/token")
    }
  }
}

# Fetch database password from Vault
data "vault_kv_secret_v2" "database" {
  mount = "secret"
  name  = "banking-platform/prod/database"
}

resource "aws_db_instance" "main" {
  identifier = "banking-prod-db"

  username = "dbadmin"
  password = data.vault_kv_secret_v2.database.data["password"]

  # ... other configuration
}

# Fetch API keys from Vault
data "vault_kv_secret_v2" "api_keys" {
  mount = "secret"
  name  = "banking-platform/prod/api-keys"
}

resource "aws_ssm_parameter" "payment_gateway_key" {
  name  = "/banking/prod/payment-gateway-api-key"
  type  = "SecureString"
  value = data.vault_kv_secret_v2.api_keys.data["payment_gateway"]

  tags = {
    ManagedBy = "Terraform"
  }
}

How it works: The Vault provider authenticates to Vault (via Kubernetes ServiceAccount JWT, AWS IAM, other methods). The vault_kv_secret_v2 data source fetches secrets from Vault's KV v2 secret engine. Secrets are used in resources (password = data.vault_kv_secret_v2.database.data["password"]) but never appear in .tf files. State files contain the secret values (another reason to encrypt state).

AWS Secrets Manager

# Fetch database credentials from AWS Secrets Manager
data "aws_secretsmanager_secret" "database" {
  name = "banking-platform/prod/database-credentials"
}

data "aws_secretsmanager_secret_version" "database" {
  secret_id = data.aws_secretsmanager_secret.database.id
}

locals {
  db_credentials = jsondecode(data.aws_secretsmanager_secret_version.database.secret_string)
}

resource "aws_db_instance" "main" {
  identifier = "banking-prod-db"

  username = local.db_credentials.username
  password = local.db_credentials.password

  # ... other configuration
}

# Create new secret in Secrets Manager
resource "aws_secretsmanager_secret" "api_key" {
  name                    = "banking-platform/prod/external-api-key"
  description             = "API key for external payment gateway"
  recovery_window_in_days = 30  # Deletion protection

  tags = {
    ManagedBy = "Terraform"
  }
}

resource "aws_secretsmanager_secret_version" "api_key" {
  secret_id = aws_secretsmanager_secret.api_key.id

  secret_string = jsonencode({
    api_key    = var.payment_gateway_api_key  # From TF_VAR or other source
    api_secret = var.payment_gateway_api_secret
  })
}

# Enable automatic rotation
resource "aws_secretsmanager_secret_rotation" "database" {
  secret_id           = data.aws_secretsmanager_secret.database.id
  rotation_lambda_arn = aws_lambda_function.rotate_database_password.arn

  rotation_rules {
    automatically_after_days = 30
  }
}

Secrets Manager benefits: Automatic rotation (Lambda function periodically rotates secrets), versioning (retrieve previous secret versions), audit logging (CloudTrail logs all secret access), fine-grained IAM policies (restrict which IAM roles can read which secrets).

Random Passwords

# Generate random password
resource "random_password" "database" {
  length  = 32
  special = true

  # Prevent recreation on every apply
  keepers = {
    database_identifier = "banking-prod-db"
  }
}

resource "aws_db_instance" "main" {
  identifier = "banking-prod-db"

  username = "dbadmin"
  password = random_password.database.result

  # ... other configuration
}

# Store generated password in Secrets Manager
resource "aws_secretsmanager_secret" "database_password" {
  name = "banking-platform/prod/database-admin-password"
}

resource "aws_secretsmanager_secret_version" "database_password" {
  secret_id = aws_secretsmanager_secret.database_password.id

  secret_string = jsonencode({
    username = "dbadmin"
    password = random_password.database.result
    host     = aws_db_instance.main.address
    port     = aws_db_instance.main.port
  })
}

Random password management: The random_password resource generates cryptographically random passwords. The keepers map ensures password stability - the password only changes if database_identifier changes (otherwise, every terraform apply generates a new password). Storing the generated password in Secrets Manager makes it available to applications (applications read from Secrets Manager, not from Terraform outputs).

CI/CD Integration

Integrating Terraform with CI/CD pipelines automates infrastructure changes, enforces review processes, and provides deployment consistency. GitLab CI/CD and GitHub Actions are common platforms for Terraform automation.

Understanding Terraform in CI/CD

Manual Terraform requires engineers to run terraform plan and terraform apply locally. This approach lacks consistency (different Terraform versions, different AWS credentials), traceability (no record of who applied what), and safety (no peer review, easy to apply wrong environment).

Automated Terraform runs in CI/CD pipelines: terraform plan runs on every pull request, showing proposed changes; terraform apply runs after PR approval, applying changes from a consistent environment; all runs logged and auditable.

GitOps workflow: Infrastructure changes follow the same process as code changes (branch, PR, review, merge, deploy). Pull requests show Terraform plans in comments, enabling reviewers to verify changes before approval. Merges to main trigger automated applies. This workflow aligns with our pull request best practices and code review guidelines, ensuring infrastructure changes receive the same scrutiny as application code.

For comprehensive CI/CD pipeline configuration, see GitLab CI/CD Pipelines.

GitLab CI/CD Pipeline

# .gitlab-ci.yml
stages:
  - validate
  - plan
  - apply

variables:
  TF_ROOT: ${CI_PROJECT_DIR}/environments/prod
  TF_VERSION: "1.6.0"

before_script:
  - cd ${TF_ROOT}
  - apk add --no-cache curl
  - curl -o terraform.zip https://releases.hashicorp.com/terraform/${TF_VERSION}/terraform_${TF_VERSION}_linux_amd64.zip
  - unzip terraform.zip
  - mv terraform /usr/local/bin/
  - terraform --version

cache:
  paths:
    - ${TF_ROOT}/.terraform

validate:
  stage: validate
  script:
    - terraform init -backend=false
    - terraform fmt -check -recursive
    - terraform validate
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

plan:
  stage: plan
  script:
    - terraform init
    - terraform workspace select ${CI_COMMIT_REF_NAME} || terraform workspace new ${CI_COMMIT_REF_NAME}
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - ${TF_ROOT}/tfplan
    expire_in: 1 week
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

apply:
  stage: apply
  script:
    - terraform init
    - terraform workspace select prod
    - terraform apply -auto-approve tfplan
  dependencies:
    - plan
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual  # Require manual approval
  environment:
    name: production
    action: prepare

Pipeline flow: Merge requests trigger validate and plan stages. The validate stage checks Terraform formatting and configuration validity. The plan stage generates an execution plan showing proposed changes. Merges to main trigger apply stage (manual approval required), applying the previously-generated plan. The plan artifact ensures the applied changes match the reviewed plan (even if code changed between plan and apply).

Terraform Plan in MR Comments

# .gitlab-ci.yml
plan:comment:
  stage: plan
  image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest
  script:
    - cd ${TF_ROOT}
    - terraform init
    - terraform workspace select ${CI_MERGE_REQUEST_TARGET_BRANCH_NAME}
    - terraform plan -no-color | tee plan.txt
    - |
      cat <<EOF > comment.md
      ## Terraform Plan for \`${TF_ROOT}\`

      <details>
      <summary>Show plan</summary>

      \`\`\`hcl
      $(cat plan.txt)
      \`\`\`

      </details>
      EOF
    - |
      curl --request POST \
        --header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
        --header "Content-Type: application/json" \
        --data "{\"body\": $(jq -Rs . < comment.md)}" \
        "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/merge_requests/${CI_MERGE_REQUEST_IID}/notes"
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  allow_failure: true

Plan comments: The pipeline posts Terraform plans as merge request comments. Reviewers see exactly what resources will be created, modified, or destroyed without running Terraform locally. This visibility improves review quality and catches errors before apply.

Preventing Drift

# .gitlab-ci.yml
drift:detection:
  stage: validate
  script:
    - terraform init
    - terraform workspace select prod
    - terraform plan -detailed-exitcode
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"  # Run daily via pipeline schedule
  allow_failure: false  # Fail if drift detected

Drift detection: Scheduled pipelines run terraform plan -detailed-exitcode daily. Exit code 2 indicates changes are needed (drift detected), causing the pipeline to fail and alerting the team. This catches manual changes (someone modified resources via console) or external changes (auto-scaling changed instance count).

Drift Detection

Drift occurs when real-world infrastructure diverges from Terraform state. Manual changes (via cloud console, CLI) cause drift. External systems (auto-scaling, automated remediation) cause drift. Detecting and reconciling drift maintains Terraform as the source of truth.

Understanding Drift

Why drift happens: Engineer hotfixes production issue via console (adds security group rule), automated system modifies resources (auto-scaling adjusts instance count), external process creates resources (monitoring system creates CloudWatch alarms).

Why drift matters: Terraform plans based on state don't account for manual changes, leading to unexpected results. The next terraform apply might revert necessary hotfixes (removes the manually-added security group rule). Infrastructure documentation (Terraform configs) becomes outdated, creating confusion.

Drift remediation strategies depend on the nature and impact of the drift:

Import and codify (preferred for intentional changes): When drift represents a desired change that bypassed the normal process (emergency hotfix, manual remediation during incident), add the change to .tf files and run terraform apply to reconcile. This preserves the change while bringing it under Terraform management. For example, if an engineer added a security group rule during an incident to restore service, document that rule in the security group module, verify the configuration matches reality with terraform plan (should show no changes), and commit the updated code. This approach maintains Terraform as the source of truth.
Revert to declared state (appropriate for unintended changes): When drift represents an error or unauthorized change (accidental console modification, misconfigured automation), run terraform apply to restore the declared state. This approach works when the current infrastructure state is incorrect and needs correction. Before reverting, ensure the reversion won't cause service disruption - a manually-added security group rule allowing traffic from a new service will break that service when reverted.
Accept and ignore (for externally-managed attributes): When drift is expected and intentional (auto-scaling group desired capacity managed by Kubernetes Cluster Autoscaler, tags added by cloud governance automation), use lifecycle ignore_changes to tell Terraform to ignore specific attributes. This prevents Terraform from fighting with external systems. Document why the attribute is ignored in comments - future maintainers need to understand the decision.

Detecting Drift

# Manual drift detection
$ terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes needed (no drift)
# 1 = error
# 2 = changes needed (drift detected)

# Refresh state without applying changes
$ terraform refresh

# Show current state
$ terraform show

# Compare state to reality
$ terraform plan -refresh-only

Refresh-only mode: The -refresh-only flag updates state to match reality without applying configuration changes. This is useful for reconciling Terraform state with manual changes before deciding whether to revert or incorporate them.

Automated Drift Detection

# drift-detection.sh
#!/bin/bash
set -e

ENVIRONMENTS=("dev" "staging" "prod")
DRIFT_DETECTED=false

for env in "${ENVIRONMENTS[@]}"; do
  echo "Checking drift in $env..."
  cd "environments/$env"

  terraform init -input=false

  if ! terraform plan -detailed-exitcode -input=false -no-color > "drift-$env.txt" 2>&1; then
    DRIFT_EXIT_CODE=$?

    if [ $DRIFT_EXIT_CODE -eq 2 ]; then
      echo "  Drift detected in $env environment!"
      echo "Changes required:"
      cat "drift-$env.txt"
      DRIFT_DETECTED=true
    else
BAD:       echo " Error running terraform plan in $env"
      exit 1
    fi
  else
GOOD:     echo " No drift in $env"
  fi

  cd ../..
done

if [ "$DRIFT_DETECTED" = true ]; then
  echo "Drift detected in one or more environments. Please review and remediate."
  exit 2
fi

echo "No drift detected in any environment."
exit 0

Automated workflow: Run this script daily via CI/CD schedule. When drift is detected, the script fails, triggering alerts (Slack notification, email, PagerDuty). Engineers investigate drift and decide on remediation strategy.

Importing Existing Resources

# Import manually-created security group
$ terraform import aws_security_group.manual_sg sg-0123456789abcdef

# Import requires resource definition in .tf files
# Add to main.tf:
resource "aws_security_group" "manual_sg" {
  name        = "manually-created-sg"
  description = "Security group created manually, now imported"
  vpc_id      = module.networking.vpc_id

  # Add ingress/egress rules to match actual state
}

# After import, run plan to verify
$ terraform plan  # Should show no changes if definition matches reality

Import workflow: Identify manually-created resources (from drift detection output). Add resource definitions to .tf files matching actual configuration. Run terraform import with resource address and cloud resource ID. Verify with terraform plan (should show no changes). This brings manual resources under Terraform management.

Preventing Drift

# Use lifecycle rules to ignore external changes
resource "aws_autoscaling_group" "app" {
  name = "app-asg"

  min_size         = 2
  max_size         = 10
  desired_capacity = 3

  lifecycle {
    ignore_changes = [
      desired_capacity,  # ASG controller modifies this
    ]
  }
}

# Protect critical resources from deletion
resource "aws_db_instance" "main" {
  identifier = "banking-prod-db"

  lifecycle {
    prevent_destroy = true  # Terraform refuses to destroy this
  }
}

Lifecycle rules: The ignore_changes lifecycle rule tells Terraform to ignore changes to specific attributes. Use this for attributes managed by external systems (auto-scaling desired capacity, security group rules managed by AWS Config). The prevent_destroy rule protects critical resources from accidental deletion.

Testing Infrastructure Code

Infrastructure code should be tested like application code. Tests validate module behavior, catch regressions, and document expected outcomes. Terratest provides Go-based testing for Terraform modules.

Understanding Infrastructure Testing

Unit tests validate individual modules in isolation (networking module creates correct number of subnets, security group module creates expected rules). Unit tests use temporary infrastructure (create, test, destroy) or mock cloud provider responses.

Unit testing infrastructure differs from application unit testing because infrastructure tests interact with real cloud providers. You cannot mock AWS - the test must create real VPCs, subnets, and security groups. This makes infrastructure unit tests slower and more expensive than application unit tests. Terratest addresses this by providing cleanup guarantees (defer terraform.Destroy) and parallelization (t.Parallel()), enabling multiple tests to run concurrently in isolated AWS accounts or regions.

Cost management for infrastructure tests is critical. Tests creating production-scale resources (large RDS instances, expensive EC2 types) quickly accumulate costs. Use the smallest viable resource sizes for tests (db.t4g.micro instead of db.r6g.xlarge, single-AZ instead of multi-AZ). Run tests in isolated AWS accounts with billing alerts. Clean up aggressively - failed tests should still destroy resources via defer clauses. Consider scheduled cleanup jobs that destroy resources tagged with test markers older than 24 hours, catching resources from crashed test runs.

Integration tests validate modules working together (application can connect to database through created networking, load balancer routes traffic to instances). Integration tests create real infrastructure, run assertions, then destroy.

Integration tests verify end-to-end workflows that span multiple modules. For example, an integration test might deploy networking, database, and application modules, then verify the application can connect to the database through private networking. These tests catch issues that unit tests miss - subnet routing misconfiguration, security group rule gaps, incompatible module versions.

Integration tests take longer than unit tests (deploying full stacks takes 10-30 minutes) and cost more (more resources running for longer). Reserve integration tests for critical workflows and run them less frequently - unit tests on every commit, integration tests on merges to main or scheduled overnight.

Contract tests validate module interfaces (networking module outputs expected values, database module accepts expected inputs). Contract tests ensure module consumers and producers agree on interfaces.

Contract testing prevents breaking changes to module interfaces. If the networking module removes the database_subnet_ids output that the database module depends on, contract tests catch this. Contract tests run quickly (no infrastructure creation) and should run on every commit. They verify output types match expected types, required outputs exist, and variable validation rules are present for required inputs.

Terratest Example

// tests/networking_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestNetworkingModule(t *testing.T) {
    t.Parallel()

    // Define test variables
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/networking",

        Vars: map[string]interface{}{
            "environment":        "test",
            "vpc_cidr":          "10.100.0.0/16",
            "availability_zones": []string{"us-east-1a", "us-east-1b"},
            "public_subnets":    []string{"10.100.1.0/24", "10.100.2.0/24"},
            "private_subnets":   []string{"10.100.101.0/24", "10.100.102.0/24"},
            "database_subnets":  []string{"10.100.201.0/24", "10.100.202.0/24"},
            "enable_nat_gateway": true,
            "single_nat_gateway": false,
        },
    })

    // Destroy infrastructure at end of test
    defer terraform.Destroy(t, terraformOptions)

    // Create infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Validate outputs
    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID, "VPC ID should not be empty")

    publicSubnetIDs := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
    assert.Len(t, publicSubnetIDs, 2, "Should create 2 public subnets")

    privateSubnetIDs := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
    assert.Len(t, privateSubnetIDs, 2, "Should create 2 private subnets")

    // Validate VPC exists in AWS
    vpc := aws.GetVpcById(t, vpcID, "us-east-1")
    assert.Equal(t, "10.100.0.0/16", vpc.CidrBlock, "VPC CIDR should match")

    // Validate subnets are in correct AZs
    subnet1 := aws.GetSubnetById(t, publicSubnetIDs[0], "us-east-1")
    assert.Equal(t, "us-east-1a", subnet1.AvailabilityZone, "Subnet should be in us-east-1a")

    subnet2 := aws.GetSubnetById(t, publicSubnetIDs[1], "us-east-1")
    assert.Equal(t, "us-east-1b", subnet2.AvailabilityZone, "Subnet should be in us-east-1b")

    // Validate NAT gateways created
    natGatewayIPs := terraform.OutputList(t, terraformOptions, "nat_gateway_ips")
    assert.Len(t, natGatewayIPs, 2, "Should create 2 NAT gateways (one per AZ)")
}

func TestDatabaseModule(t *testing.T) {
    t.Parallel()

    // Create networking first (dependency)
    networkingOptions := &terraform.Options{
        TerraformDir: "../modules/networking",
        Vars: map[string]interface{}{
            "environment":        "test",
            "vpc_cidr":          "10.101.0.0/16",
            "availability_zones": []string{"us-east-1a", "us-east-1b"},
            "public_subnets":    []string{"10.101.1.0/24"},
            "private_subnets":   []string{"10.101.101.0/24"},
            "database_subnets":  []string{"10.101.201.0/24", "10.101.202.0/24"},
            "enable_nat_gateway": false,
        },
    }

    defer terraform.Destroy(t, networkingOptions)
    terraform.InitAndApply(t, networkingOptions)

    vpcID := terraform.Output(t, networkingOptions, "vpc_id")
    subnetIDs := terraform.OutputList(t, networkingOptions, "database_subnet_ids")

    // Create database using networking outputs
    databaseOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/database",

        Vars: map[string]interface{}{
            "environment":      "test",
            "identifier":       "test-db",
            "engine":           "postgres",
            "engine_version":   "16.1",
            "instance_class":   "db.t4g.micro",
            "allocated_storage": 20,
            "database_name":    "testdb",
            "username":         "testuser",
            "password":         "TestPassword123!",
            "vpc_id":           vpcID,
            "subnet_ids":       subnetIDs,
        },
    })

    defer terraform.Destroy(t, databaseOptions)
    terraform.InitAndApply(t, databaseOptions)

    // Validate database endpoint
    dbEndpoint := terraform.Output(t, databaseOptions, "endpoint")
    assert.NotEmpty(t, dbEndpoint, "Database endpoint should not be empty")
    assert.Contains(t, dbEndpoint, "rds.amazonaws.com", "Endpoint should be RDS hostname")
}

Test structure: Each test function creates infrastructure (terraform.InitAndApply), validates outputs and AWS resources (assertions), then destroys infrastructure (defer terraform.Destroy). The t.Parallel() enables running multiple tests concurrently, reducing test execution time. This testing approach follows the same principles as our general testing strategy, adapted for infrastructure code.

Running tests:

# Run all tests
$ cd tests
$ go test -v -timeout 30m

# Run specific test
$ go test -v -run TestNetworkingModule -timeout 30m

# Run tests in parallel
$ go test -v -parallel 5 -timeout 30m

Validation Without Creating Resources

# tests/validate_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
)

func TestTerraformValidation(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/networking",
    }

    // Validate Terraform configuration syntax
    terraform.Validate(t, terraformOptions)
}

func TestVariableValidation(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/networking",

        Vars: map[string]interface{}{
            "environment": "invalid-env",  // Should fail validation
        },
    }

    // This should fail due to validation rule
    _, err := terraform.InitAndPlanE(t, terraformOptions)
    assert.Error(t, err, "Should fail validation for invalid environment")
}

Validation tests check Terraform configuration validity without creating resources. These tests run quickly (no infrastructure creation) and catch syntax errors, invalid variable values, and constraint violations.

Summary

Key Takeaways

Modular structure - Reusable modules for networking, compute, database; environment-specific root modules
Remote state - S3 + DynamoDB for state storage and locking; never use local state for teams
State encryption - S3 SSE-KMS for state at rest; HTTPS for state in transit
Variable validation - Catch errors early with validation rules; document constraints
Secret management - Vault or Secrets Manager for sensitive values; never commit secrets
Workspaces vs directories - Directories for strong isolation; workspaces for simple projects
CI/CD integration - Automated plans on PRs, manual applies to production, drift detection
Module design - Clear inputs/outputs, validation, documentation, testability
Drift detection - Scheduled plans detect manual changes; import or revert as needed
Testing - Terratest for module validation; integration tests for end-to-end flows

Next Steps: Review Kubernetes Best Practices for managing Kubernetes with Terraform, GitLab CI/CD Pipelines for pipeline configuration, and Secrets Management for comprehensive secret handling strategies.

Overview​

Core Principles​

Project Structure​

Understanding Project Organization​

Recommended Structure​

Root Module Example​

Module Design Patterns​

Understanding Module Design​

Networking Module​

State Management​

Understanding Terraform State​

Remote State with S3 and DynamoDB​

State Locking​

State Encryption​

Workspace Strategies​

Understanding Workspaces​

Workspace Usage​

Workspace-Aware Configuration​

Variable Management​

Understanding Variable Precedence​

Variable Definitions​

Variable Values​

Environment Variables​

Secret Handling​

Understanding Secret Management​

Vault Integration​

AWS Secrets Manager​

Random Passwords​

CI/CD Integration​

Understanding Terraform in CI/CD​

GitLab CI/CD Pipeline​

Terraform Plan in MR Comments​

Preventing Drift​

Drift Detection​

Understanding Drift​

Detecting Drift​

Automated Drift Detection​

Importing Existing Resources​

Preventing Drift​

Testing Infrastructure Code​

Understanding Infrastructure Testing​

Terratest Example​

Validation Without Creating Resources​

Further Reading​

Internal Documentation​

External Resources​

Summary​

Key Takeaways​

Overview

Core Principles

Project Structure

Understanding Project Organization

Recommended Structure

Root Module Example

Module Design Patterns

Understanding Module Design

Networking Module

State Management

Understanding Terraform State

Remote State with S3 and DynamoDB

State Locking

State Encryption

Workspace Strategies

Understanding Workspaces

Workspace Usage

Workspace-Aware Configuration

Variable Management

Understanding Variable Precedence

Variable Definitions

Variable Values

Environment Variables

Secret Handling

Understanding Secret Management

Vault Integration

AWS Secrets Manager

Random Passwords

CI/CD Integration

Understanding Terraform in CI/CD

GitLab CI/CD Pipeline

Terraform Plan in MR Comments

Preventing Drift

Drift Detection

Understanding Drift

Detecting Drift

Automated Drift Detection

Importing Existing Resources

Preventing Drift

Testing Infrastructure Code

Understanding Infrastructure Testing

Terratest Example

Validation Without Creating Resources

Further Reading

Internal Documentation

External Resources

Summary

Key Takeaways