AWS Cost Optimization
Cloud costs can spiral out of control without deliberate optimization. Unlike traditional infrastructure where costs are fixed (you own the hardware), cloud costs are variable - every resource provisioned, every API call made, every gigabyte transferred incurs charges. This creates opportunity (pay only for what you use) and risk (uncontrolled usage leads to unexpected bills).
Cost optimization is not a one-time activity but a continuous practice. It requires visibility into spending, accountability through cost allocation, architectural decisions that favor efficiency, and ongoing monitoring to catch waste. The most effective cost optimization combines technical improvements (right-sizing instances, using Reserved Instances) with organizational practices (cost ownership, budget alerts, architectural reviews).
The AWS Well-Architected Framework identifies cost optimization as one of five core pillars. The principle is simple: achieve business outcomes while minimizing costs. This means understanding spending patterns, eliminating waste, and selecting cost-effective resources without sacrificing performance, reliability, or security.
Cost optimization means spending efficiently to achieve business goals. Cost cutting means reducing spend at the expense of outcomes. Focus on optimizing - getting more value per dollar spent - rather than arbitrary budget reductions that harm service quality.
Core Principles
- Cost Visibility: Understand what you're spending and where
- Cost Allocation: Attribute costs to teams, projects, environments
- Right-Sizing: Match resources to actual workload requirements
- Waste Elimination: Identify and remove unused/idle resources
- Cost-Aware Architecture: Design applications with cost implications in mind
- Continuous Optimization: Regularly review and optimize spending
- Cost Ownership: Teams responsible for resources own their costs
Cost Visibility
You cannot optimize what you cannot measure. AWS provides multiple tools for understanding costs, but they require configuration to be useful.
AWS Cost Explorer
Cost Explorer provides interactive visualization of spending over time. It answers questions like "What did we spend last month?" and "Which services cost the most?"
Key features:
- Trends: Spending over time (daily, monthly, yearly)
- Service breakdown: Costs by AWS service (EC2, RDS, S3, data transfer, etc.)
- Filtering: By account, region, service, tag, instance type
- Forecasting: Predict future costs based on historical trends
Example analysis workflow:
Access Cost Explorer:
AWS Console → Billing → Cost Explorer
Use filters to drill down: "Show EC2 costs in us-east-1 for production environment tagged resources."
Cost Allocation Tags
Tags are key-value pairs attached to AWS resources. Cost allocation tags enable you to categorize and track costs by dimensions meaningful to your organization (team, project, environment, cost center).
Standard tagging strategy:
Environment: prod | staging | dev
Team: platform | payments | mobile
Application: api-gateway | user-service | notification-service
CostCenter: engineering | marketing | operations
Owner: [email protected]
Project: customer-onboarding | fraud-detection
Why this matters: Without tags, all costs appear as undifferentiated spending. With tags, you can answer:
- "How much does the payments team spend vs mobile team?"
- "What's the cost of our production environment vs staging?"
- "Which projects are most expensive to run?"
Implementing Tagging with Terraform
# Enforce consistent tagging via Terraform provider defaults
provider "aws" {
region = var.aws_region
default_tags {
tags = {
ManagedBy = "Terraform"
Environment = var.environment
Team = var.team
Application = var.application
CostCenter = var.cost_center
Owner = var.owner_email
}
}
}
# All resources automatically inherit these tags
resource "aws_instance" "app" {
ami = var.ami_id
instance_type = var.instance_type
# Additional resource-specific tags
tags = {
Name = "${var.environment}-app-server"
Role = "application-server"
}
}
Benefit: Every resource created by Terraform automatically has standard tags. This prevents untagged resources and ensures consistent cost allocation. See Terraform Best Practices for comprehensive tagging patterns.
Enforcing Tagging with IAM Policies
Prevent resource creation without required tags:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireTagsOnResourceCreation",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance",
"s3:CreateBucket"
],
"Resource": "*",
"Condition": {
"StringNotLike": {
"aws:RequestTag/Environment": ["prod", "staging", "dev"],
"aws:RequestTag/Team": "*",
"aws:RequestTag/CostCenter": "*"
}
}
}
]
}
This policy denies EC2, RDS, and S3 resource creation unless the request includes Environment, Team, and CostCenter tags with valid values.
Tagging Existing Untagged Resources
Find untagged resources:
# Find EC2 instances without Environment tag
aws ec2 describe-instances \
--query 'Reservations[].Instances[?!Tags[?Key==`Environment`]].[InstanceId,State.Name]' \
--output table
# Find RDS instances without Team tag
aws rds describe-db-instances \
--query 'DBInstances[?!TagList[?Key==`Team`]].[DBInstanceIdentifier,DBInstanceClass]' \
--output table
Tag resources in bulk:
# Tag multiple EC2 instances
aws ec2 create-tags \
--resources i-1234567890abcdef0 i-0987654321fedcba0 \
--tags Key=Environment,Value=prod Key=Team,Value=platform
Automated tagging with Lambda:
# lambda_auto_tagger.py
import boto3
import json
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
"""
Automatically tag new EC2 instances based on creator's identity.
Triggered by CloudWatch Events on RunInstances API call.
"""
detail = event['detail']
instance_ids = []
for item in detail['responseElements']['instancesSet']['items']:
instance_ids.append(item['instanceId'])
# Get creator information from CloudTrail event
user_identity = detail['userIdentity']
creator = user_identity.get('principalId', 'unknown')
# Apply default tags
ec2.create_tags(
Resources=instance_ids,
Tags=[
{'Key': 'CreatedBy', 'Value': creator},
{'Key': 'AutoTagged', 'Value': 'true'},
{'Key': 'CreatedAt', 'Value': detail['eventTime']},
]
)
return {
'statusCode': 200,
'body': json.dumps(f'Tagged {len(instance_ids)} instances')
}
Deploy this Lambda with an EventBridge rule triggered on RunInstances API calls. New instances automatically receive creator tags for accountability.
AWS Cost and Usage Report (CUR)
Cost Explorer provides aggregated views, but Cost and Usage Report gives granular, line-item billing data. CUR delivers hourly usage data to S3 for analysis with tools like Athena, QuickSight, or third-party BI platforms.
CUR contains:
- Every resource used (instance ID, S3 bucket name)
- Usage amount (instance hours, GB transferred)
- Cost (exact dollar amount)
- Tags (all cost allocation tags)
- Pricing details (on-demand, reserved, spot)
Enable CUR:
AWS Console → Billing → Cost & Usage Reports → Create Report
Query CUR with Athena:
-- Find top 10 most expensive resources last month
SELECT
line_item_resource_id,
product_product_name,
SUM(line_item_unblended_cost) as total_cost
FROM
cost_and_usage_report
WHERE
line_item_usage_start_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1' MONTH)
AND line_item_usage_start_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY
line_item_resource_id,
product_product_name
ORDER BY
total_cost DESC
LIMIT 10;
This identifies your most expensive individual resources (specific EC2 instances, RDS databases, S3 buckets) for targeted optimization.
Compute Cost Optimization
Compute (EC2, ECS, Lambda) typically represents the largest portion of AWS bills. Optimization focuses on right-sizing, selecting appropriate pricing models, and auto-scaling.
Right-Sizing EC2 Instances
Problem: Over-provisioned instances waste money. An application running on r6g.2xlarge (8 vCPU, 64 GB RAM) but only using 10% CPU and 20% memory is wasting ~$300/month.
Solution: Use AWS Compute Optimizer and CloudWatch metrics to identify underutilized instances.
AWS Compute Optimizer
Compute Optimizer analyzes historical utilization (CPU, memory, network, disk) and recommends optimal instance types.
Example recommendation:
Current: r6g.2xlarge (8 vCPU, 64 GB RAM) - $430/month
Recommendation: r6g.large (2 vCPU, 16 GB RAM) - $107/month
Savings: $323/month (75% reduction)
Confidence: High (based on 14 days of metrics)
Reasoning:
- Average CPU: 12%
- Peak CPU: 25%
- Average Memory: 18%
- r6g.large provides sufficient capacity with headroom
Access Compute Optimizer:
AWS Console → Compute Optimizer → EC2 Recommendations
Automated right-sizing script:
# right_size_ec2.py
import boto3
optimizer = boto3.client('compute-optimizer')
ec2 = boto3.client('ec2')
def get_recommendations():
"""Get EC2 right-sizing recommendations."""
response = optimizer.get_ec2_instance_recommendations()
for rec in response['instanceRecommendations']:
instance_id = rec['instanceArn'].split('/')[-1]
current_type = rec['currentInstanceType']
# Check for savings opportunity
for option in rec['recommendationOptions']:
if option['rank'] == 1: # Top recommendation
recommended_type = option['instanceType']
savings = rec.get('savingsOpportunity', {})
if savings.get('estimatedMonthlySavings', {}).get('value', 0) > 50:
print(f"Instance: {instance_id}")
print(f" Current: {current_type}")
print(f" Recommended: {recommended_type}")
print(f" Monthly Savings: ${savings['estimatedMonthlySavings']['value']:.2f}")
print(f" Confidence: {rec['finding']}")
print()
if __name__ == '__main__':
get_recommendations()
Run this monthly to identify optimization opportunities.
Don't right-size too aggressively. Leave 20-30% headroom for traffic spikes. Monitor application performance after downsizing. If response times increase or error rates rise, the instance was too small.
Savings Plans and Reserved Instances
Problem: On-demand instances are flexible but expensive. If you run instances 24/7, you're overpaying.
Solution: Commit to usage with Savings Plans or Reserved Instances for 30-70% discounts.
Savings Plans vs Reserved Instances
| Feature | Savings Plans | Reserved Instances |
|---|---|---|
| Discount | Up to 72% | Up to 72% |
| Flexibility | Any instance family/size/region (Compute SP) | Specific instance type/region |
| Term | 1 or 3 years | 1 or 3 years |
| Payment | All upfront / Partial / No upfront | All upfront / Partial / No upfront |
| Applies to | EC2, Lambda, Fargate | EC2 only |
| Recommendation | Preferred (more flexible) | Legacy option |
Savings Plans are generally better: They automatically apply to any EC2 instance family, Lambda, or Fargate usage. You don't need to predict exact instance types.
Example:
Baseline spend: $10,000/month on EC2 (mix of m6i.large, r6g.xlarge, c6g.2xlarge)
Purchase Compute Savings Plan: $6,000/month commitment (1-year, no upfront)
Discount rate: 40%
Result:
- $6,000/month at 40% discount = $10,000/month worth of compute
- Covers all $10,000 of usage
- Savings: $4,000/month = $48,000/year
How to purchase:
AWS Console → Billing → Savings Plans → Recommendations
AWS analyzes your usage and recommends commitment amounts based on historical patterns.
Terraform for Reserved Instances (if needed):
Reserved Instance purchases cannot be made via Terraform (AWS API limitation), but you can track them:
# Document RI commitments (for reference)
locals {
reserved_instances = {
"payment-api-prod" = {
instance_type = "r6g.xlarge"
count = 5
term = "1-year"
payment = "no-upfront"
monthly_cost = 285 # $57/instance * 5
}
}
}
See AWS Compute for detailed instance type selection and sizing strategies.
Spot Instances
Problem: On-demand and Reserved Instances are expensive for fault-tolerant, interruptible workloads.
Solution: Use Spot Instances (up to 90% discount) for workloads that can handle interruptions.
Suitable workloads:
- Batch processing jobs
- Data analysis (big data, ML training)
- CI/CD pipeline agents
- Development/test environments
- Stateless containerized applications with auto-scaling
Not suitable for:
- Databases (interruption causes downtime)
- Long-running stateful processes
- Production APIs without redundancy
Example: Spot Instances for ECS tasks:
resource "aws_ecs_capacity_provider" "spot" {
name = "spot-capacity-provider"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.ecs_spot.arn
managed_scaling {
status = "ENABLED"
target_capacity = 80 # Target 80% utilization
minimum_scaling_step_size = 1
maximum_scaling_step_size = 100
}
}
}
resource "aws_autoscaling_group" "ecs_spot" {
name = "ecs-spot-asg"
min_size = 1
max_size = 10
desired_capacity = 3
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 0 # All Spot
on_demand_percentage_above_base_capacity = 0
spot_allocation_strategy = "price-capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.ecs.id
}
# Request multiple instance types for availability
override {
instance_type = "t3.large"
}
override {
instance_type = "t3a.large"
}
override {
instance_type = "t2.large"
}
}
}
}
Key configuration: spot_allocation_strategy = "price-capacity-optimized" balances cost and interruption risk.
Handling Spot interruptions:
Spot instances can be interrupted with 2-minute warning. Handle gracefully:
# spot_interrupt_handler.py
import boto3
import time
import requests
def check_spot_interruption():
"""Check if this Spot instance is scheduled for termination."""
try:
# AWS provides interruption notice via instance metadata
response = requests.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
timeout=1
)
if response.status_code == 200:
return True # Interruption scheduled
except requests.exceptions.RequestException:
pass # No interruption
return False
def graceful_shutdown():
"""Drain tasks and shut down gracefully."""
print("Spot interruption detected. Draining tasks...")
# Deregister from load balancer
# Stop accepting new work
# Finish in-progress tasks (up to 120 seconds)
# Exit
time.sleep(5) # Simulate work drain
print("Shutdown complete")
exit(0)
if __name__ == '__main__':
while True:
if check_spot_interruption():
graceful_shutdown()
time.sleep(5) # Check every 5 seconds
Run this as a sidecar process on Spot instances. When interruption is detected, gracefully shut down.
Auto-Scaling for Cost Optimization
Problem: Running fixed capacity 24/7 wastes money during low-traffic periods.
Solution: Auto-scale based on demand. Scale down during nights/weekends, scale up during business hours or traffic spikes.
Example: Schedule-based scaling for dev environments:
# Scale down dev environment outside business hours
resource "aws_autoscaling_schedule" "scale_down_evening" {
scheduled_action_name = "scale-down-evening"
autoscaling_group_name = aws_autoscaling_group.dev.name
recurrence = "0 18 * * MON-FRI" # 6 PM weekdays
min_size = 0
max_size = 0
desired_capacity = 0
}
resource "aws_autoscaling_schedule" "scale_up_morning" {
scheduled_action_name = "scale-up-morning"
autoscaling_group_name = aws_autoscaling_group.dev.name
recurrence = "0 8 * * MON-FRI" # 8 AM weekdays
min_size = 1
max_size = 3
desired_capacity = 2
}
Savings: Dev environment runs 10 hours/day instead of 24 hours → 58% cost reduction.
Metric-based scaling for production:
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up-on-cpu"
autoscaling_group_name = aws_autoscaling_group.prod.name
adjustment_type = "ChangeInCapacity"
scaling_adjustment = 2 # Add 2 instances
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0 # Maintain 70% average CPU
}
}
Auto-scaling ensures you pay only for capacity you need.
Storage Cost Optimization
Storage costs accumulate from S3 objects, EBS volumes, snapshots, and data transfer. Optimization focuses on lifecycle management and storage tiering.
S3 Lifecycle Policies
Problem: Objects stored in S3 Standard forever, even if rarely accessed.
Solution: Automatically transition objects to cheaper storage classes based on access patterns.
S3 storage class pricing (approximate, us-east-1):
| Storage Class | $/GB-month | Retrieval Cost | Use Case |
|---|---|---|---|
| Standard | $0.023 | None | Frequently accessed |
| Intelligent-Tiering | $0.023 + monitoring | None | Unknown access patterns |
| Standard-IA | $0.0125 | $0.01/GB | Infrequently accessed (monthly) |
| Glacier Instant | $0.004 | $0.03/GB | Archive with instant retrieval |
| Glacier Flexible | $0.0036 | $0.02/GB + 3-5 hour wait | Archive, rarely accessed |
| Glacier Deep Archive | $0.00099 | $0.02/GB + 12 hour wait | Long-term archive (yearly access) |
Lifecycle policy example:
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "archive-old-logs"
status = "Enabled"
# Transition to IA after 30 days
transition {
days = 30
storage_class = "STANDARD_IA"
}
# Transition to Glacier after 90 days
transition {
days = 90
storage_class = "GLACIER"
}
# Delete after 365 days
expiration {
days = 365
}
# Delete incomplete multipart uploads after 7 days
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
rule {
id = "delete-old-versions"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 90
}
}
}
Savings example:
- 1 TB of logs in S3 Standard: $23.55/month
- After 30 days → Standard-IA: $12.80/month (46% savings)
- After 90 days → Glacier: $3.69/month (84% savings)
S3 Intelligent-Tiering
Problem: You don't know access patterns (some objects frequently accessed, others rarely).
Solution: S3 Intelligent-Tiering automatically moves objects between access tiers.
How it works:
- Objects not accessed for 30 days → Infrequent Access tier (40% savings)
- Objects not accessed for 90 days → Archive Instant Access (68% savings)
- Objects not accessed for 180 days → Archive Access (71% savings)
- Objects not accessed for 365 days → Deep Archive (95% savings)
Enable Intelligent-Tiering:
resource "aws_s3_bucket_intelligent_tiering_configuration" "user_uploads" {
bucket = aws_s3_bucket.user_uploads.id
name = "entire-bucket"
tiering {
access_tier = "ARCHIVE_ACCESS"
days = 90
}
tiering {
access_tier = "DEEP_ARCHIVE_ACCESS"
days = 180
}
}
Cost: Small monitoring fee ($0.0025 per 1,000 objects), but savings outweigh cost for most workloads.
See File Storage (S3) for comprehensive S3 patterns.
EBS Snapshot Cleanup
Problem: EBS snapshots accumulate over time. Old snapshots (no longer needed) cost $0.05/GB-month.
Solution: Automate snapshot deletion.
Identify old snapshots:
# Find snapshots older than 90 days
aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<='$(date -u -d '90 days ago' +%Y-%m-%d)'].[SnapshotId,StartTime,VolumeSize]" \
--output table
Automated cleanup with Lambda:
# lambda_snapshot_cleanup.py
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
"""Delete EBS snapshots older than 90 days."""
cutoff_date = datetime.now() - timedelta(days=90)
snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
deleted_count = 0
freed_gb = 0
for snapshot in snapshots:
start_time = snapshot['StartTime'].replace(tzinfo=None)
if start_time < cutoff_date:
snapshot_id = snapshot['SnapshotId']
volume_size = snapshot['VolumeSize']
# Check if snapshot is in use by any AMI
images = ec2.describe_images(
Filters=[{'Name': 'block-device-mapping.snapshot-id', 'Values': [snapshot_id]}]
)
if not images['Images']: # Not used by AMI
print(f"Deleting snapshot {snapshot_id} ({volume_size} GB, created {start_time})")
ec2.delete_snapshot(SnapshotId=snapshot_id)
deleted_count += 1
freed_gb += volume_size
print(f"Deleted {deleted_count} snapshots, freed {freed_gb} GB")
print(f"Estimated monthly savings: ${freed_gb * 0.05:.2f}")
return {'deletedCount': deleted_count, 'freedGB': freed_gb}
Schedule this Lambda weekly with EventBridge.
Database Cost Optimization
Databases (RDS, Aurora, DynamoDB) are expensive. Optimization focuses on right-sizing, using appropriate instance types, and serverless options.
RDS Right-Sizing
Problem: Over-provisioned RDS instances (e.g., db.r6g.2xlarge when db.r6g.large is sufficient).
Solution: Monitor CloudWatch metrics and downsize.
Key metrics:
CPUUtilization: Should average 50-70% (headroom for spikes)
DatabaseConnections: Track connection pool usage
ReadLatency/WriteLatency: Ensure queries remain fast after downsizing
FreeableMemory: Memory utilization (should have buffer)
Query CloudWatch metrics:
# Get average CPU for RDS instance over last 7 days
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=prod-db \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average
If average CPU < 30% consistently: Downsize instance.
Aurora Serverless v2
Problem: RDS/Aurora instances run 24/7 even during low-traffic periods (nights, weekends).
Solution: Aurora Serverless v2 automatically scales capacity based on load.
How it works:
- Define min/max Aurora Capacity Units (ACUs)
- Aurora scales up during traffic spikes (seconds)
- Aurora scales down during low traffic (saves cost)
- Pay only for ACUs consumed (per-second billing)
Example configuration:
resource "aws_rds_cluster" "serverless" {
cluster_identifier = "prod-aurora-serverless"
engine = "aurora-postgresql"
engine_mode = "provisioned" # Serverless v2 uses provisioned mode
engine_version = "14.6"
database_name = "appdb"
master_username = "admin"
master_password = random_password.db_password.result
serverlessv2_scaling_configuration {
min_capacity = 0.5 # 0.5 ACU = ~1 GB RAM (minimum)
max_capacity = 16 # 16 ACU = ~32 GB RAM (max for traffic spikes)
}
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
vpc_security_group_ids = [aws_security_group.aurora.id]
db_subnet_group_name = aws_db_subnet_group.aurora.name
}
resource "aws_rds_cluster_instance" "serverless" {
cluster_identifier = aws_rds_cluster.serverless.id
instance_class = "db.serverless" # Serverless instance class
engine = aws_rds_cluster.serverless.engine
}
Cost comparison:
Traditional RDS: db.r6g.large (24/7) = $200/month
Aurora Serverless v2:
- Low traffic (0.5 ACU, 16 hours/day): $15
- Medium traffic (4 ACU, 6 hours/day): $30
- High traffic (8 ACU, 2 hours/day): $20
Total: ~$65/month (67% savings)
Best for: Applications with variable load (e.g., business hours traffic, weekend downtime).
See AWS Databases for Aurora configuration details.
DynamoDB On-Demand vs Provisioned
DynamoDB pricing models:
- Provisioned: Pre-allocate read/write capacity units (RCU/WCU). Pay for provisioned capacity regardless of usage.
- On-Demand: Pay per request (read/write). No capacity planning.
When to use each:
| Workload Pattern | Recommended Mode | Reason |
|---|---|---|
| Predictable, steady traffic | Provisioned | Cheaper per request |
| Unpredictable, spiky traffic | On-Demand | No over-provisioning |
| New application (unknown traffic) | On-Demand | No capacity planning |
| High, consistent throughput | Provisioned with auto-scaling | Most cost-effective |
Example cost comparison:
Workload: 10 million reads/month, 5 million writes/month
Provisioned (with auto-scaling):
- 10M reads / 2.5M seconds/month = 4 RCU average
- 5M writes / 2.5M seconds/month = 2 WCU average
- Provision 10 RCU, 5 WCU (headroom for spikes)
Cost: (10 * $0.00065) + (5 * $0.00065) * 730 hours = $7.10/month
On-Demand:
- 10M reads * $0.25/million = $2.50
- 5M writes * $1.25/million = $6.25
Cost: $8.75/month
Conclusion: Provisioned cheaper for predictable workload
Switch between modes:
resource "aws_dynamodb_table" "users" {
name = "users"
billing_mode = "PAY_PER_REQUEST" # On-Demand
# Change to provisioned when traffic stabilizes
# billing_mode = "PROVISIONED"
# read_capacity = 10
# write_capacity = 5
}
Network Cost Optimization
Data transfer is a hidden cost driver. Traffic within a region is cheap; cross-region and internet traffic is expensive.
Data Transfer Costs (Approximate)
| Transfer Type | Cost | Notes |
|---|---|---|
| Within AZ | Free | Same AZ, private IP |
| Between AZs | $0.01/GB | Cross-AZ in same region |
| Between Regions | $0.02/GB | Cross-region transfer |
| To Internet | $0.09/GB (first 10 TB) | Egress charges |
| From Internet | Free | Ingress is free |
Key insight: Internet egress is expensive. Minimize by using CloudFront (CDN) for static assets.
VPC Endpoints (Avoid NAT Gateway Costs)
Problem: Instances in private subnets access S3/DynamoDB via NAT Gateway. NAT Gateway charges $0.045/GB processed + $0.045/hour.
Solution: VPC endpoints provide direct access to AWS services without NAT Gateway.
Cost savings:
Traffic: 1 TB/month to S3
Without VPC endpoint: 1000 GB * $0.045 = $45/month
With VPC endpoint: $0/month
Savings: $45/month = $540/year
Create VPC endpoints:
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.aws_region}.s3"
route_table_ids = [
aws_route_table.private.id,
]
}
# DynamoDB Gateway Endpoint (free)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.aws_region}.dynamodb"
route_table_ids = [
aws_route_table.private.id,
]
}
See AWS Networking for VPC endpoint configuration.
CloudFront for Static Assets
Problem: Serving static assets (images, JavaScript, CSS) directly from S3 incurs egress charges ($0.09/GB).
Solution: Use CloudFront (CDN) to cache content at edge locations. CloudFront egress is cheaper ($0.085/GB, with volume discounts).
Additional benefits:
- Faster load times (content served from edge locations near users)
- Reduced S3 requests (CloudFront caches reduce origin load)
Cost comparison:
Traffic: 10 TB/month static assets
Direct from S3:
- 10,000 GB * $0.09/GB = $900/month
Via CloudFront:
- 10,000 GB * $0.085/GB = $850/month (first 10 TB pricing)
- Plus reduced S3 requests (CloudFront cache hit ratio 80%+)
Savings: ~$50/month + request cost savings
See CloudFront and CDN for caching strategies and File Storage (S3) for S3 integration.
Serverless Cost Optimization
Lambda, API Gateway, and other serverless services bill per use. Optimization focuses on efficient code and configuration.
Lambda Memory Tuning
Lambda pricing:
- GB-second: Memory allocated × execution time
- Requests: $0.20 per 1 million requests
Memory affects both cost and performance: More memory = more CPU → faster execution. Optimal memory balances cost and speed.
Example:
| Memory | Execution Time | Cost per Invocation |
|---|---|---|
| 128 MB | 1000 ms | $0.0000002083 |
| 256 MB | 500 ms | $0.0000002083 |
| 512 MB | 300 ms | $0.00000025 |
| 1024 MB | 200 ms | $0.00000033 |
Observations:
- 256 MB is same cost as 128 MB (double memory, half duration)
- 512 MB is slightly more expensive
- 1024 MB is most expensive (diminishing returns)
Optimal: 256 MB for this workload.
Tool: AWS Lambda Power Tuning
Automates testing different memory configurations:
# Deploy Lambda Power Tuning (one-time setup)
aws cloudformation deploy \
--template-file power-tuning.yaml \
--stack-name lambda-power-tuning \
--capabilities CAPABILITY_IAM
# Run power tuning for a function
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:powerTuning \
--input '{
"lambdaARN": "arn:aws:lambda:us-east-1:123456789:function:my-function",
"powerValues": [128, 256, 512, 1024, 1536, 3008],
"num": 50
}'
Returns optimal memory configuration for cost and performance.
Provisioned Concurrency vs On-Demand
Lambda cold starts: First invocation of a function takes longer (initialize runtime, load code). Subsequent invocations (warm starts) are fast.
Provisioned concurrency: Keep N instances warm at all times. Eliminates cold starts but costs money even when idle.
When to use:
- On-demand: Most workloads (cold starts acceptable, cost-sensitive)
- Provisioned concurrency: Latency-critical APIs where cold starts impact UX (e.g., sub-100ms response time requirements)
Cost comparison:
Function: 100 requests/second during business hours (8 hours/day)
On-Demand:
- 100 req/s * 8 hours * 3600 s = 2.88M requests/day
- 2.88M * $0.20/million = $0.58/day
- Cold starts: 10-20 per hour (acceptable)
Provisioned Concurrency (10 instances):
- Provisioned: 10 instances * 24 hours * 30 days = 7,200 instance-hours
- 7,200 * $0.000004 * 1024 MB = $29.49/month
- Plus invocation costs: $0.58/day * 30 = $17.40/month
Total: $46.89/month (81x more expensive than on-demand)
Conclusion: Only use provisioned concurrency when cold starts are unacceptable.
See AWS Compute for Lambda optimization patterns.
Budgets and Alerts
Prevent surprise bills with proactive budget alerts.
AWS Budgets
Create budget:
AWS Console → Billing → Budgets → Create Budget
Budget types:
- Cost budget: Alert when spending exceeds threshold
- Usage budget: Alert when usage (e.g., EC2 hours) exceeds threshold
- Reservation budget: Alert when RI/Savings Plan utilization drops below target
Example budget configuration:
resource "aws_budgets_budget" "monthly_cost" {
name = "monthly-cost-budget"
budget_type = "COST"
limit_amount = "10000" # $10,000/month
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80 # Alert at 80% ($8,000)
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [
"[email protected]",
"[email protected]",
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100 # Alert at 100% ($10,000)
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [
"[email protected]",
"[email protected]",
"[email protected]",
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90 # Forecast alert
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED" # Based on trend
subscriber_email_addresses = [
"[email protected]",
]
}
}
Budget actions (automated responses):
resource "aws_budgets_budget_action" "stop_ec2_on_overspend" {
budget_name = aws_budgets_budget.monthly_cost.name
action_type = "RUN_SSM_DOCUMENTS"
approval_model = "AUTOMATIC"
notification_type = "ACTUAL"
execution_role_arn = aws_iam_role.budget_action.arn
action_threshold {
action_threshold_type = "PERCENTAGE"
action_threshold_value = 100
}
definition {
ssm_action_definition {
action_sub_type = "STOP_EC2_INSTANCES"
region = "us-east-1"
instance_ids = [aws_instance.dev.id] # Stop dev instances
}
}
}
This automatically stops dev instances when budget is exceeded (prevents runaway costs).
Cost Anomaly Detection
AWS uses machine learning to detect unusual spending patterns:
AWS Console → Billing → Cost Anomaly Detection → Create Monitor
Example: Your typical daily S3 spend is $50. One day it jumps to $500 (10x increase). Cost Anomaly Detection alerts you immediately.
Enable alerts:
resource "aws_ce_anomaly_monitor" "service_monitor" {
name = "service-cost-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE" # Monitor per AWS service
}
resource "aws_ce_anomaly_subscription" "alerts" {
name = "cost-anomaly-alerts"
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn,
]
subscriber {
type = "EMAIL"
address = "[email protected]"
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"] # Alert if anomaly > $100
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
FinOps Practices
FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending. It combines technology, business, and finance teams to optimize costs.
Cost Ownership
Principle: Teams that create cloud resources own their costs.
Implementation:
- Tag all resources with team/project/cost center
- Show teams their spending (dashboards, monthly reports)
- Set team budgets and hold teams accountable
- Incentivize optimization (reward teams for reducing costs while maintaining performance)
Cost dashboard example (QuickSight):
-- Query for team spending dashboard
SELECT
resource_tags_user_team AS team,
DATE_TRUNC('month', line_item_usage_start_date) AS month,
SUM(line_item_unblended_cost) AS total_cost
FROM
cost_and_usage_report
WHERE
line_item_usage_start_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6' MONTH)
GROUP BY
team,
month
ORDER BY
month DESC,
total_cost DESC;
Display this in a dashboard accessible to all teams. Visibility drives accountability.
Regular Optimization Reviews
Quarterly cost optimization process:
Assign ownership: Platform/DevOps team leads optimization, but all engineering teams participate.
Cost-Aware Architecture
Design principle: Consider cost implications during architectural decisions.
Examples:
- Choose appropriate storage: Use S3 Glacier for archives, not S3 Standard ($0.00099/GB vs $0.023/GB)
- Minimize data transfer: Keep compute and data in same region/AZ
- Use serverless for variable workloads: Lambda for sporadic tasks, not EC2 running 24/7
- Cache aggressively: CloudFront, ElastiCache reduce origin load and data transfer
- Design for auto-scaling: Don't run fixed capacity; scale to demand
Include cost estimation in architectural design reviews. Ask: "What will this cost at 10x scale?"
Tools for Cost Optimization
AWS Compute Optimizer
Recommends optimal instance types, EBS volumes, and Lambda configurations based on utilization metrics.
Access: AWS Console → Compute Optimizer
Automate with AWS CLI:
# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations
# Get Lambda recommendations
aws compute-optimizer get-lambda-function-recommendations
# Get EBS recommendations
aws compute-optimizer get-ebs-volume-recommendations
AWS Trusted Advisor
Provides checks across cost, performance, security, fault tolerance, and service limits.
Cost checks include:
- Idle RDS instances
- Underutilized EC2 instances
- Unassociated Elastic IPs
- Low-utilization EBS volumes
- Idle load balancers
Access: AWS Console → Trusted Advisor
Automate checks:
# trusted_advisor_checks.py
import boto3
support = boto3.client('support', region_name='us-east-1') # Trusted Advisor is global
def get_cost_optimization_checks():
"""Get Trusted Advisor cost optimization recommendations."""
# Get all checks
checks = support.describe_trusted_advisor_checks(language='en')
# Filter for cost optimization category
cost_checks = [c for c in checks['checks'] if c['category'] == 'cost_optimizing']
for check in cost_checks:
check_id = check['id']
result = support.describe_trusted_advisor_check_result(
checkId=check_id,
language='en'
)
flagged_resources = result['result'].get('flaggedResources', [])
if flagged_resources:
print(f"\nCheck: {check['name']}")
print(f"Description: {check['description']}")
print(f"Flagged Resources: {len(flagged_resources)}")
for resource in flagged_resources[:5]: # Show first 5
print(f" - {resource.get('resourceId')}: {resource.get('status')}")
if __name__ == '__main__':
get_cost_optimization_checks()
Full Trusted Advisor checks require Business or Enterprise support plan. Developer plan has limited checks.
Third-Party Tools
- CloudHealth (VMware): Multi-cloud cost management, showback/chargeback
- Cloudability (Apptio): Cost analytics, budgeting, forecasting
- Spot.io: Auto-scaling optimization, Spot instance management
- Kubecost: Kubernetes-specific cost monitoring and optimization
Common Cost Pitfalls
Unused Resources
Problem: Resources created then forgotten.
Examples:
- Stopped EC2 instances (still pay for EBS volumes)
- Unused Elastic IPs ($0.005/hour = $3.60/month per IP)
- Idle RDS instances (dev/test databases running 24/7)
- Orphaned EBS volumes (detached from deleted instances)
- Old snapshots (never cleaned up)
Detection:
# Find stopped instances with EBS volumes
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=stopped" \
--query "Reservations[].Instances[].[InstanceId,State.Name,BlockDeviceMappings[].Ebs.VolumeId]"
# Find unattached EBS volumes
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query "Volumes[].[VolumeId,Size,CreateTime]"
# Find idle Elastic IPs
aws ec2 describe-addresses \
--query "Addresses[?AssociationId==null].[PublicIp,AllocationId]"
Automated cleanup: Tag resources with AutoDelete: true and run Lambda to delete after N days inactive.
Over-Provisioning
Problem: "Better safe than sorry" mentality leads to massive over-provisioning.
Example: Provisioning db.r6g.4xlarge (16 vCPU, 128 GB RAM) for a database that uses 2 vCPU and 16 GB RAM. 8x over-provisioned → wasting ~$600/month.
Fix: Start small, monitor, scale up if needed. It's easier to scale up than justify downsizing.
Ignoring Data Transfer
Problem: Not realizing data transfer costs until bill arrives.
Examples:
- Cross-region replication without understanding cost ($0.02/GB)
- Serving videos directly from S3 instead of CloudFront
- NAT Gateway for S3 access (use VPC endpoints)
Fix: Design data flows to minimize transfer. Keep compute and data co-located.
Not Using Reserved Capacity
Problem: Running production workloads on-demand for years.
Example: $10,000/month EC2 on-demand → could be $6,000/month with Savings Plans → wasting $4,000/month = $48,000/year.
Fix: Review Cost Explorer, identify consistent workloads, purchase Savings Plans.
Forgetting to Clean Up Test Environments
Problem: Test infrastructure left running after testing completes.
Example: Load testing creates 50 EC2 instances. Test finishes, instances forgotten. $2,000/month waste.
Fix: Tag test resources with Environment: test and TTL: 2024-01-15. Lambda deletes resources after TTL.
# lambda_ttl_cleanup.py
import boto3
from datetime import datetime
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
"""Delete resources past their TTL (Time To Live)."""
instances = ec2.describe_instances(
Filters=[{'Name': 'tag-key', 'Values': ['TTL']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
ttl_tag = next((t['Value'] for t in instance['Tags'] if t['Key'] == 'TTL'), None)
if ttl_tag:
ttl_date = datetime.fromisoformat(ttl_tag)
if datetime.now() > ttl_date:
print(f"Terminating instance {instance_id} (TTL expired: {ttl_tag})")
ec2.terminate_instances(InstanceIds=[instance_id])
Summary
AWS cost optimization is a continuous process combining visibility, accountability, and architectural decisions:
- Visibility: Use Cost Explorer, cost allocation tags, and Cost and Usage Reports to understand spending
- Compute: Right-size instances, use Savings Plans/Reserved Instances, leverage Spot instances, auto-scale
- Storage: Implement S3 lifecycle policies, use Intelligent-Tiering, clean up snapshots
- Database: Right-size RDS, use Aurora Serverless for variable workloads, optimize DynamoDB billing mode
- Network: Use VPC endpoints, CloudFront for static assets, minimize cross-region transfers
- Serverless: Tune Lambda memory, avoid provisioned concurrency unless necessary
- Budgets: Set up AWS Budgets and Cost Anomaly Detection for proactive alerts
- FinOps: Establish cost ownership, regular optimization reviews, cost-aware architecture
Cost optimization is not a destination but a practice. Set up quarterly reviews, automate cleanup of waste, and make cost a factor in architectural decisions.
Further Reading
- AWS Compute Services - EC2, ECS, Lambda right-sizing and optimization
- AWS Databases - RDS, Aurora, DynamoDB optimization strategies
- AWS Storage - EBS, EFS optimization
- File Storage (S3) - S3 lifecycle policies and storage classes
- AWS Networking - VPC endpoints, data transfer optimization
- CloudFront and CDN - CDN for reducing egress costs
- Terraform Best Practices - IaC for consistent tagging
- AWS Well-Architected Framework - Cost Optimization Pillar