AWS Cost Optimization

Cloud costs can spiral out of control without deliberate optimization. Unlike traditional infrastructure where costs are fixed (you own the hardware), cloud costs are variable - every resource provisioned, every API call made, every gigabyte transferred incurs charges. This creates opportunity (pay only for what you use) and risk (uncontrolled usage leads to unexpected bills).

Cost optimization is not a one-time activity but a continuous practice. It requires visibility into spending, accountability through cost allocation, architectural decisions that favor efficiency, and ongoing monitoring to catch waste. The most effective cost optimization combines technical improvements (right-sizing instances, using Reserved Instances) with organizational practices (cost ownership, budget alerts, architectural reviews).

The AWS Well-Architected Framework identifies cost optimization as one of five core pillars. The principle is simple: achieve business outcomes while minimizing costs. This means understanding spending patterns, eliminating waste, and selecting cost-effective resources without sacrificing performance, reliability, or security.

Cost Optimization vs Cost Cutting

Cost optimization means spending efficiently to achieve business goals. Cost cutting means reducing spend at the expense of outcomes. Focus on optimizing - getting more value per dollar spent - rather than arbitrary budget reductions that harm service quality.

Core Principles

Cost Visibility: Understand what you're spending and where
Cost Allocation: Attribute costs to teams, projects, environments
Right-Sizing: Match resources to actual workload requirements
Waste Elimination: Identify and remove unused/idle resources
Cost-Aware Architecture: Design applications with cost implications in mind
Continuous Optimization: Regularly review and optimize spending
Cost Ownership: Teams responsible for resources own their costs

Cost Visibility

You cannot optimize what you cannot measure. AWS provides multiple tools for understanding costs, but they require configuration to be useful.

AWS Cost Explorer

Cost Explorer provides interactive visualization of spending over time. It answers questions like "What did we spend last month?" and "Which services cost the most?"

Key features:

Trends: Spending over time (daily, monthly, yearly)
Service breakdown: Costs by AWS service (EC2, RDS, S3, data transfer, etc.)
Filtering: By account, region, service, tag, instance type
Forecasting: Predict future costs based on historical trends

Example analysis workflow:

Access Cost Explorer:

AWS Console → Billing → Cost Explorer

Use filters to drill down: "Show EC2 costs in us-east-1 for production environment tagged resources."

Cost Allocation Tags

Tags are key-value pairs attached to AWS resources. Cost allocation tags enable you to categorize and track costs by dimensions meaningful to your organization (team, project, environment, cost center).

Standard tagging strategy:

Environment: prod | staging | dev
Team: platform | payments | mobile
Application: api-gateway | user-service | notification-service
CostCenter: engineering | marketing | operations
Owner: [email protected]
Project: customer-onboarding | fraud-detection

Why this matters: Without tags, all costs appear as undifferentiated spending. With tags, you can answer:

"How much does the payments team spend vs mobile team?"
"What's the cost of our production environment vs staging?"
"Which projects are most expensive to run?"

Implementing Tagging with Terraform

# Enforce consistent tagging via Terraform provider defaults
provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      ManagedBy   = "Terraform"
      Environment = var.environment
      Team        = var.team
      Application = var.application
      CostCenter  = var.cost_center
      Owner       = var.owner_email
    }
  }
}

# All resources automatically inherit these tags
resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type

  # Additional resource-specific tags
  tags = {
    Name = "${var.environment}-app-server"
    Role = "application-server"
  }
}

Benefit: Every resource created by Terraform automatically has standard tags. This prevents untagged resources and ensures consistent cost allocation. See Terraform Best Practices for comprehensive tagging patterns.

Enforcing Tagging with IAM Policies

Prevent resource creation without required tags:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTagsOnResourceCreation",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "s3:CreateBucket"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:RequestTag/Environment": ["prod", "staging", "dev"],
          "aws:RequestTag/Team": "*",
          "aws:RequestTag/CostCenter": "*"
        }
      }
    }
  ]
}

This policy denies EC2, RDS, and S3 resource creation unless the request includes Environment, Team, and CostCenter tags with valid values.

Tagging Existing Untagged Resources

Find untagged resources:

# Find EC2 instances without Environment tag
aws ec2 describe-instances \
  --query 'Reservations[].Instances[?!Tags[?Key==`Environment`]].[InstanceId,State.Name]' \
  --output table

# Find RDS instances without Team tag
aws rds describe-db-instances \
  --query 'DBInstances[?!TagList[?Key==`Team`]].[DBInstanceIdentifier,DBInstanceClass]' \
  --output table

Tag resources in bulk:

# Tag multiple EC2 instances
aws ec2 create-tags \
  --resources i-1234567890abcdef0 i-0987654321fedcba0 \
  --tags Key=Environment,Value=prod Key=Team,Value=platform

Automated tagging with Lambda:

# lambda_auto_tagger.py
import boto3
import json

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    """
    Automatically tag new EC2 instances based on creator's identity.
    Triggered by CloudWatch Events on RunInstances API call.
    """
    detail = event['detail']
    instance_ids = []

    for item in detail['responseElements']['instancesSet']['items']:
        instance_ids.append(item['instanceId'])

    # Get creator information from CloudTrail event
    user_identity = detail['userIdentity']
    creator = user_identity.get('principalId', 'unknown')

    # Apply default tags
    ec2.create_tags(
        Resources=instance_ids,
        Tags=[
            {'Key': 'CreatedBy', 'Value': creator},
            {'Key': 'AutoTagged', 'Value': 'true'},
            {'Key': 'CreatedAt', 'Value': detail['eventTime']},
        ]
    )

    return {
        'statusCode': 200,
        'body': json.dumps(f'Tagged {len(instance_ids)} instances')
    }

Deploy this Lambda with an EventBridge rule triggered on RunInstances API calls. New instances automatically receive creator tags for accountability.

AWS Cost and Usage Report (CUR)

Cost Explorer provides aggregated views, but Cost and Usage Report gives granular, line-item billing data. CUR delivers hourly usage data to S3 for analysis with tools like Athena, QuickSight, or third-party BI platforms.

CUR contains:

Every resource used (instance ID, S3 bucket name)
Usage amount (instance hours, GB transferred)
Cost (exact dollar amount)
Tags (all cost allocation tags)
Pricing details (on-demand, reserved, spot)

Enable CUR:

AWS Console → Billing → Cost & Usage Reports → Create Report

Query CUR with Athena:

-- Find top 10 most expensive resources last month
SELECT
  line_item_resource_id,
  product_product_name,
  SUM(line_item_unblended_cost) as total_cost
FROM
  cost_and_usage_report
WHERE
  line_item_usage_start_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1' MONTH)
  AND line_item_usage_start_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY
  line_item_resource_id,
  product_product_name
ORDER BY
  total_cost DESC
LIMIT 10;

This identifies your most expensive individual resources (specific EC2 instances, RDS databases, S3 buckets) for targeted optimization.

Compute Cost Optimization

Compute (EC2, ECS, Lambda) typically represents the largest portion of AWS bills. Optimization focuses on right-sizing, selecting appropriate pricing models, and auto-scaling.

Right-Sizing EC2 Instances

Problem: Over-provisioned instances waste money. An application running on r6g.2xlarge (8 vCPU, 64 GB RAM) but only using 10% CPU and 20% memory is wasting ~$300/month.

Solution: Use AWS Compute Optimizer and CloudWatch metrics to identify underutilized instances.

AWS Compute Optimizer

Compute Optimizer analyzes historical utilization (CPU, memory, network, disk) and recommends optimal instance types.

Example recommendation:

Current: r6g.2xlarge (8 vCPU, 64 GB RAM) - $430/month
Recommendation: r6g.large (2 vCPU, 16 GB RAM) - $107/month
Savings: $323/month (75% reduction)
Confidence: High (based on 14 days of metrics)

Reasoning:
- Average CPU: 12%
- Peak CPU: 25%
- Average Memory: 18%
- r6g.large provides sufficient capacity with headroom

Access Compute Optimizer:

AWS Console → Compute Optimizer → EC2 Recommendations

Automated right-sizing script:

# right_size_ec2.py
import boto3

optimizer = boto3.client('compute-optimizer')
ec2 = boto3.client('ec2')

def get_recommendations():
    """Get EC2 right-sizing recommendations."""
    response = optimizer.get_ec2_instance_recommendations()

    for rec in response['instanceRecommendations']:
        instance_id = rec['instanceArn'].split('/')[-1]
        current_type = rec['currentInstanceType']

        # Check for savings opportunity
        for option in rec['recommendationOptions']:
            if option['rank'] == 1:  # Top recommendation
                recommended_type = option['instanceType']
                savings = rec.get('savingsOpportunity', {})

                if savings.get('estimatedMonthlySavings', {}).get('value', 0) > 50:
                    print(f"Instance: {instance_id}")
                    print(f"  Current: {current_type}")
                    print(f"  Recommended: {recommended_type}")
                    print(f"  Monthly Savings: ${savings['estimatedMonthlySavings']['value']:.2f}")
                    print(f"  Confidence: {rec['finding']}")
                    print()

if __name__ == '__main__':
    get_recommendations()

Run this monthly to identify optimization opportunities.

Right-Sizing Caution

Don't right-size too aggressively. Leave 20-30% headroom for traffic spikes. Monitor application performance after downsizing. If response times increase or error rates rise, the instance was too small.

Savings Plans and Reserved Instances

Problem: On-demand instances are flexible but expensive. If you run instances 24/7, you're overpaying.

Solution: Commit to usage with Savings Plans or Reserved Instances for 30-70% discounts.

Savings Plans vs Reserved Instances

Feature	Savings Plans	Reserved Instances
Discount	Up to 72%	Up to 72%
Flexibility	Any instance family/size/region (Compute SP)	Specific instance type/region
Term	1 or 3 years	1 or 3 years
Payment	All upfront / Partial / No upfront	All upfront / Partial / No upfront
Applies to	EC2, Lambda, Fargate	EC2 only
Recommendation	Preferred (more flexible)	Legacy option

Savings Plans are generally better: They automatically apply to any EC2 instance family, Lambda, or Fargate usage. You don't need to predict exact instance types.

Example:

Baseline spend: $10,000/month on EC2 (mix of m6i.large, r6g.xlarge, c6g.2xlarge)

Purchase Compute Savings Plan: $6,000/month commitment (1-year, no upfront)
Discount rate: 40%

Result:
- $6,000/month at 40% discount = $10,000/month worth of compute
- Covers all $10,000 of usage
- Savings: $4,000/month = $48,000/year

How to purchase:

AWS Console → Billing → Savings Plans → Recommendations

AWS analyzes your usage and recommends commitment amounts based on historical patterns.

Terraform for Reserved Instances (if needed):

Reserved Instance purchases cannot be made via Terraform (AWS API limitation), but you can track them:

# Document RI commitments (for reference)
locals {
  reserved_instances = {
    "payment-api-prod" = {
      instance_type = "r6g.xlarge"
      count         = 5
      term          = "1-year"
      payment       = "no-upfront"
      monthly_cost  = 285  # $57/instance * 5
    }
  }
}

See AWS Compute for detailed instance type selection and sizing strategies.

Spot Instances

Problem: On-demand and Reserved Instances are expensive for fault-tolerant, interruptible workloads.

Solution: Use Spot Instances (up to 90% discount) for workloads that can handle interruptions.

Suitable workloads:

Batch processing jobs
Data analysis (big data, ML training)
CI/CD pipeline agents
Development/test environments
Stateless containerized applications with auto-scaling

Not suitable for:

Databases (interruption causes downtime)
Long-running stateful processes
Production APIs without redundancy

Example: Spot Instances for ECS tasks:

resource "aws_ecs_capacity_provider" "spot" {
  name = "spot-capacity-provider"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.ecs_spot.arn

    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 80  # Target 80% utilization
      minimum_scaling_step_size = 1
      maximum_scaling_step_size = 100
    }
  }
}

resource "aws_autoscaling_group" "ecs_spot" {
  name = "ecs-spot-asg"

  min_size         = 1
  max_size         = 10
  desired_capacity = 3

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 0  # All Spot
      on_demand_percentage_above_base_capacity = 0
      spot_allocation_strategy                 = "price-capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.ecs.id
      }

      # Request multiple instance types for availability
      override {
        instance_type = "t3.large"
      }
      override {
        instance_type = "t3a.large"
      }
      override {
        instance_type = "t2.large"
      }
    }
  }
}

Key configuration: spot_allocation_strategy = "price-capacity-optimized" balances cost and interruption risk.

Handling Spot interruptions:

Spot instances can be interrupted with 2-minute warning. Handle gracefully:

# spot_interrupt_handler.py
import boto3
import time
import requests

def check_spot_interruption():
    """Check if this Spot instance is scheduled for termination."""
    try:
        # AWS provides interruption notice via instance metadata
        response = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=1
        )
        if response.status_code == 200:
            return True  # Interruption scheduled
    except requests.exceptions.RequestException:
        pass  # No interruption
    return False

def graceful_shutdown():
    """Drain tasks and shut down gracefully."""
    print("Spot interruption detected. Draining tasks...")

    # Deregister from load balancer
    # Stop accepting new work
    # Finish in-progress tasks (up to 120 seconds)
    # Exit

    time.sleep(5)  # Simulate work drain
    print("Shutdown complete")
    exit(0)

if __name__ == '__main__':
    while True:
        if check_spot_interruption():
            graceful_shutdown()
        time.sleep(5)  # Check every 5 seconds

Run this as a sidecar process on Spot instances. When interruption is detected, gracefully shut down.

Auto-Scaling for Cost Optimization

Problem: Running fixed capacity 24/7 wastes money during low-traffic periods.

Solution: Auto-scale based on demand. Scale down during nights/weekends, scale up during business hours or traffic spikes.

Example: Schedule-based scaling for dev environments:

# Scale down dev environment outside business hours
resource "aws_autoscaling_schedule" "scale_down_evening" {
  scheduled_action_name  = "scale-down-evening"
  autoscaling_group_name = aws_autoscaling_group.dev.name

  recurrence       = "0 18 * * MON-FRI"  # 6 PM weekdays
  min_size         = 0
  max_size         = 0
  desired_capacity = 0
}

resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  autoscaling_group_name = aws_autoscaling_group.dev.name

  recurrence       = "0 8 * * MON-FRI"  # 8 AM weekdays
  min_size         = 1
  max_size         = 3
  desired_capacity = 2
}

Savings: Dev environment runs 10 hours/day instead of 24 hours → 58% cost reduction.

Metric-based scaling for production:

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up-on-cpu"
  autoscaling_group_name = aws_autoscaling_group.prod.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2  # Add 2 instances

  policy_type = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0  # Maintain 70% average CPU
  }
}

Auto-scaling ensures you pay only for capacity you need.

Storage Cost Optimization

Storage costs accumulate from S3 objects, EBS volumes, snapshots, and data transfer. Optimization focuses on lifecycle management and storage tiering.

S3 Lifecycle Policies

Problem: Objects stored in S3 Standard forever, even if rarely accessed.

Solution: Automatically transition objects to cheaper storage classes based on access patterns.

S3 storage class pricing (approximate, us-east-1):

Storage Class	$/GB-month	Retrieval Cost	Use Case
Standard	$0.023	None	Frequently accessed
Intelligent-Tiering	$0.023 + monitoring	None	Unknown access patterns
Standard-IA	$0.0125	$0.01/GB	Infrequently accessed (monthly)
Glacier Instant	$0.004	$0.03/GB	Archive with instant retrieval
Glacier Flexible	$0.0036	$0.02/GB + 3-5 hour wait	Archive, rarely accessed
Glacier Deep Archive	$0.00099	$0.02/GB + 12 hour wait	Long-term archive (yearly access)

Lifecycle policy example:

resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "archive-old-logs"
    status = "Enabled"

    # Transition to IA after 30 days
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    # Transition to Glacier after 90 days
    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    # Delete after 365 days
    expiration {
      days = 365
    }

    # Delete incomplete multipart uploads after 7 days
    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }

  rule {
    id     = "delete-old-versions"
    status = "Enabled"

    noncurrent_version_transition {
      noncurrent_days = 30
      storage_class   = "GLACIER"
    }

    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

Savings example:

1 TB of logs in S3 Standard: $23.55/month
After 30 days → Standard-IA: $12.80/month (46% savings)
After 90 days → Glacier: $3.69/month (84% savings)

S3 Intelligent-Tiering

Problem: You don't know access patterns (some objects frequently accessed, others rarely).

Solution: S3 Intelligent-Tiering automatically moves objects between access tiers.

How it works:

Objects not accessed for 30 days → Infrequent Access tier (40% savings)
Objects not accessed for 90 days → Archive Instant Access (68% savings)
Objects not accessed for 180 days → Archive Access (71% savings)
Objects not accessed for 365 days → Deep Archive (95% savings)

Enable Intelligent-Tiering:

resource "aws_s3_bucket_intelligent_tiering_configuration" "user_uploads" {
  bucket = aws_s3_bucket.user_uploads.id
  name   = "entire-bucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

Cost: Small monitoring fee ($0.0025 per 1,000 objects), but savings outweigh cost for most workloads.

See File Storage (S3) for comprehensive S3 patterns.

EBS Snapshot Cleanup

Problem: EBS snapshots accumulate over time. Old snapshots (no longer needed) cost $0.05/GB-month.

Solution: Automate snapshot deletion.

Identify old snapshots:

# Find snapshots older than 90 days
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -u -d '90 days ago' +%Y-%m-%d)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

Automated cleanup with Lambda:

# lambda_snapshot_cleanup.py
import boto3
from datetime import datetime, timedelta

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    """Delete EBS snapshots older than 90 days."""
    cutoff_date = datetime.now() - timedelta(days=90)

    snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    deleted_count = 0
    freed_gb = 0

    for snapshot in snapshots:
        start_time = snapshot['StartTime'].replace(tzinfo=None)

        if start_time < cutoff_date:
            snapshot_id = snapshot['SnapshotId']
            volume_size = snapshot['VolumeSize']

            # Check if snapshot is in use by any AMI
            images = ec2.describe_images(
                Filters=[{'Name': 'block-device-mapping.snapshot-id', 'Values': [snapshot_id]}]
            )

            if not images['Images']:  # Not used by AMI
                print(f"Deleting snapshot {snapshot_id} ({volume_size} GB, created {start_time})")
                ec2.delete_snapshot(SnapshotId=snapshot_id)
                deleted_count += 1
                freed_gb += volume_size

    print(f"Deleted {deleted_count} snapshots, freed {freed_gb} GB")
    print(f"Estimated monthly savings: ${freed_gb * 0.05:.2f}")

    return {'deletedCount': deleted_count, 'freedGB': freed_gb}

Schedule this Lambda weekly with EventBridge.

Database Cost Optimization

Databases (RDS, Aurora, DynamoDB) are expensive. Optimization focuses on right-sizing, using appropriate instance types, and serverless options.

RDS Right-Sizing

Problem: Over-provisioned RDS instances (e.g., db.r6g.2xlarge when db.r6g.large is sufficient).

Solution: Monitor CloudWatch metrics and downsize.

Key metrics:

CPUUtilization: Should average 50-70% (headroom for spikes)
DatabaseConnections: Track connection pool usage
ReadLatency/WriteLatency: Ensure queries remain fast after downsizing
FreeableMemory: Memory utilization (should have buffer)

Query CloudWatch metrics:

# Get average CPU for RDS instance over last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=prod-db \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average

If average CPU < 30% consistently: Downsize instance.

Aurora Serverless v2

Problem: RDS/Aurora instances run 24/7 even during low-traffic periods (nights, weekends).

Solution: Aurora Serverless v2 automatically scales capacity based on load.

How it works:

Define min/max Aurora Capacity Units (ACUs)
Aurora scales up during traffic spikes (seconds)
Aurora scales down during low traffic (saves cost)
Pay only for ACUs consumed (per-second billing)

Example configuration:

resource "aws_rds_cluster" "serverless" {
  cluster_identifier = "prod-aurora-serverless"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"  # Serverless v2 uses provisioned mode
  engine_version     = "14.6"

  database_name   = "appdb"
  master_username = "admin"
  master_password = random_password.db_password.result

  serverlessv2_scaling_configuration {
    min_capacity = 0.5   # 0.5 ACU = ~1 GB RAM (minimum)
    max_capacity = 16    # 16 ACU = ~32 GB RAM (max for traffic spikes)
  }

  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"

  vpc_security_group_ids = [aws_security_group.aurora.id]
  db_subnet_group_name   = aws_db_subnet_group.aurora.name
}

resource "aws_rds_cluster_instance" "serverless" {
  cluster_identifier = aws_rds_cluster.serverless.id
  instance_class     = "db.serverless"  # Serverless instance class
  engine             = aws_rds_cluster.serverless.engine
}

Cost comparison:

Traditional RDS: db.r6g.large (24/7) = $200/month

Aurora Serverless v2:
- Low traffic (0.5 ACU, 16 hours/day): $15
- Medium traffic (4 ACU, 6 hours/day): $30
- High traffic (8 ACU, 2 hours/day): $20
Total: ~$65/month (67% savings)

Best for: Applications with variable load (e.g., business hours traffic, weekend downtime).

See AWS Databases for Aurora configuration details.

DynamoDB On-Demand vs Provisioned

DynamoDB pricing models:

Provisioned: Pre-allocate read/write capacity units (RCU/WCU). Pay for provisioned capacity regardless of usage.
On-Demand: Pay per request (read/write). No capacity planning.

When to use each:

Workload Pattern	Recommended Mode	Reason
Predictable, steady traffic	Provisioned	Cheaper per request
Unpredictable, spiky traffic	On-Demand	No over-provisioning
New application (unknown traffic)	On-Demand	No capacity planning
High, consistent throughput	Provisioned with auto-scaling	Most cost-effective

Example cost comparison:

Workload: 10 million reads/month, 5 million writes/month

Provisioned (with auto-scaling):
- 10M reads / 2.5M seconds/month = 4 RCU average
- 5M writes / 2.5M seconds/month = 2 WCU average
- Provision 10 RCU, 5 WCU (headroom for spikes)
Cost: (10 * $0.00065) + (5 * $0.00065) * 730 hours = $7.10/month

On-Demand:
- 10M reads * $0.25/million = $2.50
- 5M writes * $1.25/million = $6.25
Cost: $8.75/month

Conclusion: Provisioned cheaper for predictable workload

Switch between modes:

resource "aws_dynamodb_table" "users" {
  name         = "users"
  billing_mode = "PAY_PER_REQUEST"  # On-Demand

  # Change to provisioned when traffic stabilizes
  # billing_mode = "PROVISIONED"
  # read_capacity  = 10
  # write_capacity = 5
}

Network Cost Optimization

Data transfer is a hidden cost driver. Traffic within a region is cheap; cross-region and internet traffic is expensive.

Data Transfer Costs (Approximate)

Transfer Type	Cost	Notes
Within AZ	Free	Same AZ, private IP
Between AZs	$0.01/GB	Cross-AZ in same region
Between Regions	$0.02/GB	Cross-region transfer
To Internet	$0.09/GB (first 10 TB)	Egress charges
From Internet	Free	Ingress is free

Key insight: Internet egress is expensive. Minimize by using CloudFront (CDN) for static assets.

VPC Endpoints (Avoid NAT Gateway Costs)

Problem: Instances in private subnets access S3/DynamoDB via NAT Gateway. NAT Gateway charges $0.045/GB processed + $0.045/hour.

Solution: VPC endpoints provide direct access to AWS services without NAT Gateway.

Cost savings:

Traffic: 1 TB/month to S3
Without VPC endpoint: 1000 GB * $0.045 = $45/month
With VPC endpoint: $0/month
Savings: $45/month = $540/year

Create VPC endpoints:

# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.aws_region}.s3"

  route_table_ids = [
    aws_route_table.private.id,
  ]
}

# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.aws_region}.dynamodb"

  route_table_ids = [
    aws_route_table.private.id,
  ]
}

See AWS Networking for VPC endpoint configuration.

CloudFront for Static Assets

Problem: Serving static assets (images, JavaScript, CSS) directly from S3 incurs egress charges ($0.09/GB).

Solution: Use CloudFront (CDN) to cache content at edge locations. CloudFront egress is cheaper ($0.085/GB, with volume discounts).

Additional benefits:

Faster load times (content served from edge locations near users)
Reduced S3 requests (CloudFront caches reduce origin load)

Cost comparison:

Traffic: 10 TB/month static assets

Direct from S3:
- 10,000 GB * $0.09/GB = $900/month

Via CloudFront:
- 10,000 GB * $0.085/GB = $850/month (first 10 TB pricing)
- Plus reduced S3 requests (CloudFront cache hit ratio 80%+)
Savings: ~$50/month + request cost savings

See CloudFront and CDN for caching strategies and File Storage (S3) for S3 integration.

Serverless Cost Optimization

Lambda, API Gateway, and other serverless services bill per use. Optimization focuses on efficient code and configuration.

Lambda Memory Tuning

Lambda pricing:

GB-second: Memory allocated × execution time
Requests: $0.20 per 1 million requests

Memory affects both cost and performance: More memory = more CPU → faster execution. Optimal memory balances cost and speed.

Example:

Memory	Execution Time	Cost per Invocation
128 MB	1000 ms	$0.0000002083
256 MB	500 ms	$0.0000002083
512 MB	300 ms	$0.00000025
1024 MB	200 ms	$0.00000033

Observations:

256 MB is same cost as 128 MB (double memory, half duration)
512 MB is slightly more expensive
1024 MB is most expensive (diminishing returns)

Optimal: 256 MB for this workload.

Tool: AWS Lambda Power Tuning

Automates testing different memory configurations:

# Deploy Lambda Power Tuning (one-time setup)
aws cloudformation deploy \
  --template-file power-tuning.yaml \
  --stack-name lambda-power-tuning \
  --capabilities CAPABILITY_IAM

# Run power tuning for a function
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:powerTuning \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789:function:my-function",
    "powerValues": [128, 256, 512, 1024, 1536, 3008],
    "num": 50
  }'

Returns optimal memory configuration for cost and performance.

Provisioned Concurrency vs On-Demand

Lambda cold starts: First invocation of a function takes longer (initialize runtime, load code). Subsequent invocations (warm starts) are fast.

Provisioned concurrency: Keep N instances warm at all times. Eliminates cold starts but costs money even when idle.

When to use:

On-demand: Most workloads (cold starts acceptable, cost-sensitive)
Provisioned concurrency: Latency-critical APIs where cold starts impact UX (e.g., sub-100ms response time requirements)

Cost comparison:

Function: 100 requests/second during business hours (8 hours/day)

On-Demand:
- 100 req/s * 8 hours * 3600 s = 2.88M requests/day
- 2.88M * $0.20/million = $0.58/day
- Cold starts: 10-20 per hour (acceptable)

Provisioned Concurrency (10 instances):
- Provisioned: 10 instances * 24 hours * 30 days = 7,200 instance-hours
- 7,200 * $0.000004 * 1024 MB = $29.49/month
- Plus invocation costs: $0.58/day * 30 = $17.40/month
Total: $46.89/month (81x more expensive than on-demand)

Conclusion: Only use provisioned concurrency when cold starts are unacceptable.

See AWS Compute for Lambda optimization patterns.

Budgets and Alerts

Prevent surprise bills with proactive budget alerts.

AWS Budgets

Create budget:

AWS Console → Billing → Budgets → Create Budget

Budget types:

Cost budget: Alert when spending exceeds threshold
Usage budget: Alert when usage (e.g., EC2 hours) exceeds threshold
Reservation budget: Alert when RI/Savings Plan utilization drops below target

Example budget configuration:

resource "aws_budgets_budget" "monthly_cost" {
  name         = "monthly-cost-budget"
  budget_type  = "COST"
  limit_amount = "10000"  # $10,000/month
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 80  # Alert at 80% ($8,000)
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = [
      "[email protected]",
      "[email protected]",
    ]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 100  # Alert at 100% ($10,000)
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = [
      "[email protected]",
      "[email protected]",
      "[email protected]",
    ]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 90  # Forecast alert
    threshold_type      = "PERCENTAGE"
    notification_type   = "FORECASTED"  # Based on trend

    subscriber_email_addresses = [
      "[email protected]",
    ]
  }
}

Budget actions (automated responses):

resource "aws_budgets_budget_action" "stop_ec2_on_overspend" {
  budget_name        = aws_budgets_budget.monthly_cost.name
  action_type        = "RUN_SSM_DOCUMENTS"
  approval_model     = "AUTOMATIC"
  notification_type  = "ACTUAL"
  execution_role_arn = aws_iam_role.budget_action.arn

  action_threshold {
    action_threshold_type  = "PERCENTAGE"
    action_threshold_value = 100
  }

  definition {
    ssm_action_definition {
      action_sub_type = "STOP_EC2_INSTANCES"
      region          = "us-east-1"
      instance_ids    = [aws_instance.dev.id]  # Stop dev instances
    }
  }
}

This automatically stops dev instances when budget is exceeded (prevents runaway costs).

Cost Anomaly Detection

AWS uses machine learning to detect unusual spending patterns:

AWS Console → Billing → Cost Anomaly Detection → Create Monitor

Example: Your typical daily S3 spend is $50. One day it jumps to $500 (10x increase). Cost Anomaly Detection alerts you immediately.

Enable alerts:

resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"  # Monitor per AWS service
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "cost-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
  ]

  subscriber {
    type    = "EMAIL"
    address = "[email protected]"
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert if anomaly > $100
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

FinOps Practices

FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. It combines technology, business, and finance teams to optimize costs.

Cost Ownership

Principle: Teams that create cloud resources own their costs.

Implementation:

Tag all resources with team/project/cost center
Show teams their spending (dashboards, monthly reports)
Set team budgets and hold teams accountable
Incentivize optimization (reward teams for reducing costs while maintaining performance)

Cost dashboard example (QuickSight):

-- Query for team spending dashboard
SELECT
  resource_tags_user_team AS team,
  DATE_TRUNC('month', line_item_usage_start_date) AS month,
  SUM(line_item_unblended_cost) AS total_cost
FROM
  cost_and_usage_report
WHERE
  line_item_usage_start_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6' MONTH)
GROUP BY
  team,
  month
ORDER BY
  month DESC,
  total_cost DESC;

Display this in a dashboard accessible to all teams. Visibility drives accountability.

Regular Optimization Reviews

Quarterly cost optimization process:

Assign ownership: Platform/DevOps team leads optimization, but all engineering teams participate.

Cost-Aware Architecture

Design principle: Consider cost implications during architectural decisions.

Examples:

Choose appropriate storage: Use S3 Glacier for archives, not S3 Standard ($0.00099/GB vs $0.023/GB)
Minimize data transfer: Keep compute and data in same region/AZ
Use serverless for variable workloads: Lambda for sporadic tasks, not EC2 running 24/7
Cache aggressively: CloudFront, ElastiCache reduce origin load and data transfer
Design for auto-scaling: Don't run fixed capacity; scale to demand

Include cost estimation in architectural design reviews. Ask: "What will this cost at 10x scale?"

Tools for Cost Optimization

AWS Compute Optimizer

Recommends optimal instance types, EBS volumes, and Lambda configurations based on utilization metrics.

Access: AWS Console → Compute Optimizer

Automate with AWS CLI:

# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations

# Get Lambda recommendations
aws compute-optimizer get-lambda-function-recommendations

# Get EBS recommendations
aws compute-optimizer get-ebs-volume-recommendations

AWS Trusted Advisor

Provides checks across cost, performance, security, fault tolerance, and service limits.

Cost checks include:

Idle RDS instances
Underutilized EC2 instances
Unassociated Elastic IPs
Low-utilization EBS volumes
Idle load balancers

Access: AWS Console → Trusted Advisor

Automate checks:

# trusted_advisor_checks.py
import boto3

support = boto3.client('support', region_name='us-east-1')  # Trusted Advisor is global

def get_cost_optimization_checks():
    """Get Trusted Advisor cost optimization recommendations."""
    # Get all checks
    checks = support.describe_trusted_advisor_checks(language='en')

    # Filter for cost optimization category
    cost_checks = [c for c in checks['checks'] if c['category'] == 'cost_optimizing']

    for check in cost_checks:
        check_id = check['id']
        result = support.describe_trusted_advisor_check_result(
            checkId=check_id,
            language='en'
        )

        flagged_resources = result['result'].get('flaggedResources', [])

        if flagged_resources:
            print(f"\nCheck: {check['name']}")
            print(f"Description: {check['description']}")
            print(f"Flagged Resources: {len(flagged_resources)}")

            for resource in flagged_resources[:5]:  # Show first 5
                print(f"  - {resource.get('resourceId')}: {resource.get('status')}")

if __name__ == '__main__':
    get_cost_optimization_checks()

Trusted Advisor Access

Full Trusted Advisor checks require Business or Enterprise support plan. Developer plan has limited checks.

Third-Party Tools

CloudHealth (VMware): Multi-cloud cost management, showback/chargeback
Cloudability (Apptio): Cost analytics, budgeting, forecasting
Spot.io: Auto-scaling optimization, Spot instance management
Kubecost: Kubernetes-specific cost monitoring and optimization

Common Cost Pitfalls

Unused Resources

Problem: Resources created then forgotten.

Examples:

Stopped EC2 instances (still pay for EBS volumes)
Unused Elastic IPs ($0.005/hour = $3.60/month per IP)
Idle RDS instances (dev/test databases running 24/7)
Orphaned EBS volumes (detached from deleted instances)
Old snapshots (never cleaned up)

Detection:

# Find stopped instances with EBS volumes
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=stopped" \
  --query "Reservations[].Instances[].[InstanceId,State.Name,BlockDeviceMappings[].Ebs.VolumeId]"

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query "Volumes[].[VolumeId,Size,CreateTime]"

# Find idle Elastic IPs
aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].[PublicIp,AllocationId]"

Automated cleanup: Tag resources with AutoDelete: true and run Lambda to delete after N days inactive.

Over-Provisioning

Problem: "Better safe than sorry" mentality leads to massive over-provisioning.

Example: Provisioning db.r6g.4xlarge (16 vCPU, 128 GB RAM) for a database that uses 2 vCPU and 16 GB RAM. 8x over-provisioned → wasting ~$600/month.

Fix: Start small, monitor, scale up if needed. It's easier to scale up than justify downsizing.

Ignoring Data Transfer

Problem: Not realizing data transfer costs until bill arrives.

Examples:

Cross-region replication without understanding cost ($0.02/GB)
Serving videos directly from S3 instead of CloudFront
NAT Gateway for S3 access (use VPC endpoints)

Fix: Design data flows to minimize transfer. Keep compute and data co-located.

Not Using Reserved Capacity

Problem: Running production workloads on-demand for years.

Example: $10,000/month EC2 on-demand → could be $6,000/month with Savings Plans → wasting $4,000/month = $48,000/year.

Fix: Review Cost Explorer, identify consistent workloads, purchase Savings Plans.

Forgetting to Clean Up Test Environments

Problem: Test infrastructure left running after testing completes.

Example: Load testing creates 50 EC2 instances. Test finishes, instances forgotten. $2,000/month waste.

Fix: Tag test resources with Environment: test and TTL: 2024-01-15. Lambda deletes resources after TTL.

# lambda_ttl_cleanup.py
import boto3
from datetime import datetime

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    """Delete resources past their TTL (Time To Live)."""
    instances = ec2.describe_instances(
        Filters=[{'Name': 'tag-key', 'Values': ['TTL']}]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            ttl_tag = next((t['Value'] for t in instance['Tags'] if t['Key'] == 'TTL'), None)

            if ttl_tag:
                ttl_date = datetime.fromisoformat(ttl_tag)
                if datetime.now() > ttl_date:
                    print(f"Terminating instance {instance_id} (TTL expired: {ttl_tag})")
                    ec2.terminate_instances(InstanceIds=[instance_id])

Summary

AWS cost optimization is a continuous process combining visibility, accountability, and architectural decisions:

Visibility: Use Cost Explorer, cost allocation tags, and Cost and Usage Reports to understand spending
Compute: Right-size instances, use Savings Plans/Reserved Instances, leverage Spot instances, auto-scale
Storage: Implement S3 lifecycle policies, use Intelligent-Tiering, clean up snapshots
Database: Right-size RDS, use Aurora Serverless for variable workloads, optimize DynamoDB billing mode
Network: Use VPC endpoints, CloudFront for static assets, minimize cross-region transfers
Serverless: Tune Lambda memory, avoid provisioned concurrency unless necessary
Budgets: Set up AWS Budgets and Cost Anomaly Detection for proactive alerts
FinOps: Establish cost ownership, regular optimization reviews, cost-aware architecture

Cost optimization is not a destination but a practice. Set up quarterly reviews, automate cleanup of waste, and make cost a factor in architectural decisions.

Core Principles​

Cost Visibility​

AWS Cost Explorer​

Cost Allocation Tags​

Implementing Tagging with Terraform​

Enforcing Tagging with IAM Policies​

Tagging Existing Untagged Resources​

AWS Cost and Usage Report (CUR)​

Compute Cost Optimization​

Right-Sizing EC2 Instances​

AWS Compute Optimizer​

Savings Plans and Reserved Instances​

Savings Plans vs Reserved Instances​

Spot Instances​

Auto-Scaling for Cost Optimization​

Storage Cost Optimization​

S3 Lifecycle Policies​

S3 Intelligent-Tiering​

EBS Snapshot Cleanup​

Database Cost Optimization​

RDS Right-Sizing​

Aurora Serverless v2​

DynamoDB On-Demand vs Provisioned​

Network Cost Optimization​

Data Transfer Costs (Approximate)​

VPC Endpoints (Avoid NAT Gateway Costs)​

CloudFront for Static Assets​

Serverless Cost Optimization​

Lambda Memory Tuning​

Provisioned Concurrency vs On-Demand​

Budgets and Alerts​

AWS Budgets​

Cost Anomaly Detection​

FinOps Practices​

Cost Ownership​

Regular Optimization Reviews​

Cost-Aware Architecture​

Tools for Cost Optimization​

AWS Compute Optimizer​

AWS Trusted Advisor​

Third-Party Tools​

Common Cost Pitfalls​

Unused Resources​

Over-Provisioning​

Ignoring Data Transfer​

Not Using Reserved Capacity​

Forgetting to Clean Up Test Environments​

Summary​

Further Reading​

Core Principles

Cost Visibility

AWS Cost Explorer

Cost Allocation Tags

Implementing Tagging with Terraform

Enforcing Tagging with IAM Policies

Tagging Existing Untagged Resources

AWS Cost and Usage Report (CUR)

Compute Cost Optimization

Right-Sizing EC2 Instances

AWS Compute Optimizer

Savings Plans and Reserved Instances

Savings Plans vs Reserved Instances

Spot Instances

Auto-Scaling for Cost Optimization

Storage Cost Optimization

S3 Lifecycle Policies

S3 Intelligent-Tiering

EBS Snapshot Cleanup

Database Cost Optimization

RDS Right-Sizing

Aurora Serverless v2

DynamoDB On-Demand vs Provisioned

Network Cost Optimization

Data Transfer Costs (Approximate)

VPC Endpoints (Avoid NAT Gateway Costs)

CloudFront for Static Assets

Serverless Cost Optimization

Lambda Memory Tuning

Provisioned Concurrency vs On-Demand

Budgets and Alerts

AWS Budgets

Cost Anomaly Detection

FinOps Practices

Cost Ownership

Regular Optimization Reviews

Cost-Aware Architecture

Tools for Cost Optimization

AWS Compute Optimizer

AWS Trusted Advisor

Third-Party Tools

Common Cost Pitfalls

Unused Resources

Over-Provisioning

Ignoring Data Transfer

Not Using Reserved Capacity

Forgetting to Clean Up Test Environments

Summary

Further Reading