Skip to main content

AWS Overview and Foundation

Overview

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. This guide provides foundational knowledge for building production applications on AWS, covering infrastructure concepts, account organization, cost management, and architectural best practices.

Understanding AWS fundamentals is essential before implementing specific services. This document establishes core concepts that apply across all AWS services - global infrastructure, account structure, the shared responsibility model, and operational principles that guide decision-making.

AWS's breadth can be overwhelming. Start with foundational services (compute, storage, networking, databases) and expand to specialized services as needs arise. This guide focuses on concepts relevant to enterprise application development rather than attempting comprehensive service coverage.

Core Principles

  • Global infrastructure: Deploy applications in multiple geographic regions with built-in redundancy through availability zones
  • Well-Architected Framework: Design systems using five pillars - operational excellence, security, reliability, performance efficiency, and cost optimization
  • Shared responsibility: AWS secures infrastructure; you secure your applications, data, and access controls
  • API-driven automation: Every AWS service exposes APIs enabling Infrastructure as Code and programmatic management
  • Pay-for-use pricing: Pay only for resources consumed with no upfront commitments, enabling cost optimization through right-sizing

AWS Global Infrastructure

AWS operates the largest global cloud infrastructure, purpose-built for high availability, fault tolerance, and low latency. Understanding this infrastructure is critical for architecting resilient applications.

Regions

A region is a physical location around the world where AWS clusters data centers. Each region is completely independent and isolated from other regions.

Key characteristics:

  • Geographic isolation: Regions are separated by significant distances (typically hundreds of miles) to protect against natural disasters, power outages, or regional events
  • Data sovereignty: Data stored in a region stays in that region unless you explicitly replicate it elsewhere, supporting compliance with data residency laws
  • Service availability: Not all AWS services are available in all regions; newer services typically launch in larger regions first
  • Pricing variation: Costs differ by region based on local infrastructure costs, energy prices, and market conditions
  • Independent control planes: Each region has separate API endpoints and management infrastructure

Region naming convention:

  • Format: <geographic-area>-<sub-region>-<number>
  • Examples: us-east-1 (Northern Virginia), eu-west-1 (Ireland), ap-southeast-2 (Sydney)

Selecting regions:

When choosing regions, consider:

  1. Latency to end users: Deploy close to your user base for low latency. A user in Tokyo will experience significantly faster response times from ap-northeast-1 than from us-east-1
  2. Compliance requirements: Regulations may mandate data remain in specific jurisdictions (GDPR in EU, data localization laws in China, financial regulations)
  3. Service availability: Ensure required services are available in target regions
  4. Cost: Prices vary by region; us-east-1 is typically cheapest but may not be optimal for your users
  5. Disaster recovery: Choose secondary regions for failover that are geographically distant but still accessible for operations

Multi-region architecture:

Most applications start in a single region. Multi-region deployment adds significant complexity - data replication, routing, failover logic, and operational overhead. Adopt multi-region only when:

  • You have a global user base requiring low latency worldwide
  • Business continuity requirements demand protection against regional outages
  • Compliance requires data in multiple geographies

See Cloud Fundamentals for general multi-region patterns and our AWS-specific cell-based architecture guide for advanced multi-region designs.

Availability Zones (AZs)

Availability zones are one or more discrete data centers within a region, each with redundant power, networking, and connectivity. AZs are the primary mechanism for achieving high availability within AWS.

Key characteristics:

  • Physical separation: Each AZ is housed in a separate facility with independent infrastructure to prevent single points of failure
  • Low-latency connectivity: AZs within a region are connected via high-bandwidth, low-latency private fiber optic networking (typically < 2ms round-trip)
  • Synchronous replication: Low enough latency for synchronous database replication without performance degradation
  • Separate failure domains: Designed so failures (power, cooling, network) in one AZ don't cascade to others
  • Redundant connectivity: Multiple network paths to the internet and AWS backbone

Practical implications:

Every region has at least three AZs (some have six or more). When you launch resources, you specify which AZ to use or let AWS distribute them automatically.

High availability design pattern:

The standard pattern for high availability is:

  1. Deploy resources across at least three AZs
  2. Use load balancers to distribute traffic across AZs
  3. Configure databases for multi-AZ automatic failover
  4. Design applications to handle AZ failures gracefully

In this architecture, if AZ-A fails completely:

  • The load balancer stops routing to App1
  • The database automatically promotes DB2 or DB3 to primary
  • Users experience no downtime beyond brief connection resets

Cost considerations:

Deploying across AZs is essentially free - compute and storage costs are the same regardless of AZ. However, data transfer between AZs incurs small charges ($0.01-0.02/GB). For most applications, this cost is negligible compared to the availability benefits.

AZ naming inconsistency:

AZ names (us-east-1a, us-east-1b) map to different physical locations for different AWS accounts. This prevents all customers from clustering in "AZ-A." Use AZ IDs (use1-az1) when coordinating across accounts, though this is rarely necessary.

Edge Locations and Content Delivery

Edge locations are AWS points of presence (PoPs) distributed globally for content delivery and network optimization.

Services using edge locations:

  • CloudFront: Content delivery network (CDN) caching static and dynamic content close to users
  • Route 53: DNS service with global distribution for low-latency name resolution
  • AWS Global Accelerator: Network optimization routing traffic through AWS's private network
  • AWS WAF: Web application firewall deployed at edge for DDoS protection

AWS operates over 450 edge locations across 90+ cities worldwide - far more than the 33+ regions. This enables low-latency content delivery even for applications deployed in a single region.

How edge locations work:

When a user requests content served through CloudFront:

  1. Request goes to the nearest edge location
  2. If content is cached (cache hit), edge serves it directly with minimal latency
  3. If not cached (cache miss), edge fetches from origin (your application/storage in a region)
  4. Edge caches content for subsequent requests

This pattern dramatically reduces latency for static assets (images, videos, JavaScript, CSS) and reduces load on origin servers. See our CloudFront guide for implementation details.

Wavelength Zones and Local Zones

AWS offers additional infrastructure types for specialized use cases.

Wavelength Zones: Wavelength embeds AWS compute and storage inside telecommunications providers' 5G networks, enabling ultra-low latency for mobile edge applications.

Use cases:

  • AR/VR applications requiring < 10ms latency
  • Real-time gaming
  • Live video streaming with low lag
  • Industrial IoT with immediate response requirements

Local Zones: Local Zones extend AWS regions into metropolitan areas, providing single-digit millisecond latency to nearby users.

Use cases:

  • Media and entertainment workloads requiring low latency rendering
  • Live video streaming and production
  • Real-time multiplayer gaming
  • Hybrid cloud extending on-premises data centers

Most applications don't need Wavelength or Local Zones. Standard multi-AZ deployments in regions provide sufficient availability and performance. Explore these only if you have specific ultra-low-latency requirements.


AWS Account Structure and Organizations

Proper account structure is fundamental to security, cost management, and organizational governance. AWS Organizations enables centralized management of multiple AWS accounts.

Single Account vs Multi-Account Strategy

Single account: Simple to manage but creates challenges as organizations grow:

  • Difficult to isolate environments (dev, staging, production)
  • No blast radius containment - compromised credentials can access everything
  • Complex IAM policies trying to segregate access
  • Mixed billing makes cost allocation difficult

Multi-account strategy: The recommended approach uses separate AWS accounts for different purposes:

  • Improved security: Account boundaries provide strong isolation; compromised credentials in dev can't access production
  • Clear cost allocation: Each account has separate billing, making costs transparent per environment/team
  • Blast radius containment: Resource limits and failures are contained within accounts
  • Regulatory compliance: Separate accounts for PCI/HIPAA workloads simplify audit scope
  • Simplified IAM: Policies within accounts are simpler when you don't need to prevent access to other environments

AWS Organizations

AWS Organizations is a free service that enables centralized management of multiple AWS accounts.

Key features:

  1. Consolidated billing: All accounts roll up to a single bill, enabling volume discounts and unified cost tracking
  2. Service Control Policies (SCPs): Define guardrails that apply across accounts (e.g., "no one can disable CloudTrail logging")
  3. Organizational units (OUs): Group accounts logically (by environment, business unit, or compliance zone) and apply policies hierarchically
  4. Account creation automation: Programmatically create accounts with consistent baselines
  5. Centralized logging: Forward CloudTrail logs and AWS Config data to a central security account

Best practices:

  • Management account minimalism: Use the root organization account only for billing and account management - don't run workloads there
  • Separate security account: Centralize logs and security tooling in a dedicated account that workload accounts can't modify
  • Environment separation: At minimum, separate dev, staging, and production into different accounts
  • Team-based accounts: For larger organizations, give each team their own AWS account for autonomy
  • Automated provisioning: Use Infrastructure as Code (see Terraform) to create accounts with consistent networking, security, and observability configurations

Service Control Policies (SCPs)

SCPs are permission boundaries applied at the organization or OU level. They define the maximum permissions available in an account - even account administrators can't exceed SCP limits.

Common SCP use cases:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"cloudtrail:StopLogging",
"cloudtrail:DeleteTrail"
],
"Resource": "*"
}
]
}

This SCP prevents anyone in the account from disabling CloudTrail audit logging, even with administrator permissions. This creates a security guardrail that can't be circumvented.

Other valuable SCPs:

  • Restrict deployments to approved regions (compliance)
  • Prevent deletion of encryption keys
  • Require encryption for S3 buckets
  • Block public internet access to sensitive services
  • Enforce tagging standards for cost allocation

SCPs are a powerful security control. Start with permissive policies and add restrictions as you identify risks. See our IAM guide for detailed policy management.

Multi-Account Cost Allocation

With consolidated billing, you can:

  • Track costs per account (automatically)
  • Apply cost allocation tags across accounts
  • Share reserved instance and savings plan discounts
  • Set budgets per account or organizational unit

Tag resources consistently across accounts with:

  • Environment: dev, staging, production
  • Team: team-name
  • Project: project-identifier
  • CostCenter: for chargeback to business units

See Cost Optimization below for detailed cost management strategies.


AWS Shared Responsibility Model

The shared responsibility model defines which security and compliance tasks AWS handles and which you must implement. This model applies universally but varies by service type.

What AWS Manages

AWS is responsible for securing the underlying infrastructure that runs all services:

Physical and environmental:

  • Data center physical security (access control, surveillance, biometrics)
  • Power and cooling redundancy
  • Hardware lifecycle and disposal (drives destroyed before leaving facilities)
  • Physical network infrastructure

Infrastructure software:

  • Hypervisor security and isolation between customer instances
  • Managed service infrastructure (RDS host OS, Lambda execution environment)
  • Foundation services (EC2, S3, VPC networking hardware)
  • Global network backbone

Compliance infrastructure: AWS maintains certifications (SOC 1/2/3, ISO 27001, PCI-DSS Level 1, FedRAMP, HIPAA) for infrastructure. You inherit these certifications but must configure services correctly to maintain compliance.

What You Manage

You are responsible for securing everything you deploy and configure on AWS infrastructure:

Identity and access management:

  • Creating and managing IAM users, roles, and policies
  • Implementing least privilege access controls
  • Enabling and enforcing MFA for privileged accounts
  • Rotating credentials and access keys regularly
  • Federating identities from corporate directories

See Authorization for access control patterns and our IAM guide for AWS-specific implementation.

Data protection:

  • Classifying data based on sensitivity
  • Encrypting data at rest using KMS or customer-managed keys
  • Encrypting data in transit using TLS
  • Managing encryption keys and rotation policies
  • Implementing backup and retention strategies
  • Secure data disposal when no longer needed

See Data Protection for encryption strategies and Secrets Management for credential handling.

Application security:

  • Writing secure code without vulnerabilities (SQL injection, XSS, CSRF)
  • Managing application dependencies and patching CVEs
  • Implementing input validation and output encoding
  • Securing APIs with authentication and rate limiting
  • Logging security events for audit and incident response

See Security Testing and Input Validation for application security practices.

Network configuration:

  • Designing VPC architecture with proper subnet isolation
  • Configuring security groups and network ACLs correctly
  • Not exposing resources publicly unless necessary
  • Implementing VPC endpoints for private service access
  • Monitoring network traffic with VPC Flow Logs

Our AWS networking guide covers these patterns in detail.

Operating system (for IaaS like EC2):

  • Installing OS security patches
  • Configuring host firewalls
  • Hardening OS configurations
  • Managing local users and SSH keys
  • Installing antivirus/intrusion detection if required

Compliance verification:

  • Ensuring your AWS configuration meets regulatory requirements
  • Generating compliance reports and evidence
  • Responding to audits
  • Implementing compensating controls as needed

Why This Matters

A common mistake is assuming AWS secures your applications and data automatically. AWS secures the infrastructure, but misconfigurations are your responsibility.

Real-world examples of customer responsibility failures:

  • Public S3 buckets: AWS provides granular access controls, but customers configure buckets as publicly accessible, exposing sensitive data
  • Overprivileged IAM policies: Granting AdministratorAccess when least privilege requires much narrower permissions
  • Unencrypted databases: AWS offers RDS encryption at rest, but it's not enabled by default - you must enable it
  • No MFA on privileged accounts: AWS supports MFA, but you must enforce it through policies

Security breaches due to customer misconfiguration are common and entirely preventable through proper configuration and governance. Treat AWS security as an active responsibility, not a passive benefit.


AWS Well-Architected Framework

The AWS Well-Architected Framework provides best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. This framework codifies lessons from thousands of customer architectures.

Operational Excellence

Operational excellence focuses on running and monitoring systems to deliver business value and continually improving processes and procedures.

Design principles:

  • Perform operations as code: Define infrastructure, configuration, and runbooks as code to enable versioning, testing, and automation (see Terraform)
  • Make frequent, small, reversible changes: Deploy changes incrementally to reduce blast radius and enable fast rollback
  • Refine operations procedures frequently: Regularly review and improve runbooks, automation, and incident response
  • Anticipate failure: Conduct pre-mortems, game days, and chaos engineering to identify weaknesses (see Chaos Engineering)
  • Learn from operational events: Conduct blameless post-mortems and share lessons across teams (see Incident Post-Mortems)

Key practices:

  • Use Infrastructure as Code for all resources - never click through console for production changes
  • Implement comprehensive observability with logs, metrics, and traces (see Observability)
  • Automate responses to common events (auto-scaling, automated remediation)
  • Test operational procedures regularly - don't wait for production incidents to discover gaps
  • Create dashboards showing business and technical metrics

Operational excellence is about treating operations with the same rigor as software development - versioned, tested, and continuously improved.

Security

The security pillar focuses on protecting information, systems, and assets while delivering business value.

Design principles:

  • Implement strong identity foundation: Implement least privilege access with centralized identity management and eliminate long-lived credentials (see Authorization)
  • Enable traceability: Monitor and audit all actions and changes to enable investigation and compliance
  • Apply security at all layers: Defense in depth - secure network, application, data, and infrastructure layers
  • Automate security best practices: Implement software-defined security controls that can be versioned and tested
  • Protect data in transit and at rest: Classify data and use encryption, tokenization, and access control
  • Keep people away from data: Eliminate direct access to data, using automated tools instead
  • Prepare for security events: Have incident response plans, tools, and practiced procedures

Key practices:

  • Enable MFA for all privileged accounts
  • Use IAM roles rather than long-lived access keys
  • Encrypt sensitive data at rest and in transit
  • Enable CloudTrail for audit logging across all accounts
  • Use AWS Config to monitor resource configurations
  • Implement automated compliance checking (AWS Security Hub, custom rules)
  • Conduct regular security assessments and penetration testing

See Security Overview for comprehensive security patterns.

Reliability

Reliability ensures a workload performs its intended function correctly and consistently when expected.

Design principles:

  • Automatically recover from failure: Monitor systems for KPI thresholds and trigger automated recovery
  • Test recovery procedures: Regularly test failover, backup restoration, and disaster recovery plans
  • Scale horizontally: Distribute requests across multiple small resources rather than one large resource
  • Stop guessing capacity: Use auto-scaling based on actual demand
  • Manage change through automation: Use Infrastructure as Code to manage changes and enable rollback

Key practices:

  • Deploy across multiple Availability Zones for high availability
  • Use load balancers to distribute traffic and detect failures
  • Implement auto-scaling for compute resources based on demand
  • Back up data regularly and test restoration procedures
  • Design for graceful degradation when dependencies fail (see Resilience)
  • Use exponential backoff and jitter for retries
  • Implement circuit breakers for external dependencies
  • Set appropriate resource quotas and limits

Availability targets:

AvailabilityDowntime per YearDowntime per MonthUse Case
99%3.65 days7.3 hoursInternal tools
99.9%8.76 hours43.8 minutesStandard applications
99.95%4.38 hours21.9 minutesBusiness-critical
99.99%52.6 minutes4.38 minutesMission-critical

Design for the availability your business requires - achieving 99.99% is significantly more expensive than 99.9%.

Performance Efficiency

Performance efficiency focuses on using computing resources efficiently to meet requirements and maintaining efficiency as demand changes.

Design principles:

  • Democratize advanced technologies: Use managed services for complex capabilities (ML, analytics, databases)
  • Go global in minutes: Deploy in multiple regions to reduce latency for global users
  • Use serverless architectures: Eliminate operational burden and pay only for value delivered
  • Experiment more often: Virtual resources make experimentation cost-effective
  • Consider mechanical sympathy: Understand how services work to use them appropriately

Key practices:

  • Choose appropriate compute types for workload characteristics (CPU-optimized, memory-optimized, GPU)
  • Use caching at multiple layers to reduce redundant processing (see Caching)
  • Implement CDNs (CloudFront) for global content delivery
  • Use managed databases that handle scaling, replication, and optimization
  • Monitor performance metrics and set alarms for degradation
  • Regularly review and benchmark performance under realistic load (see Performance Testing)

See Performance Optimization for detailed optimization strategies.

Cost Optimization

Cost optimization focuses on avoiding unnecessary costs while achieving business outcomes.

Design principles:

  • Implement cloud financial management: Establish cost awareness, control, and optimization practices
  • Adopt a consumption model: Pay only for resources you consume
  • Measure overall efficiency: Monitor business outcomes per dollar spent
  • Stop spending on undifferentiated heavy lifting: Use managed services to reduce operational costs
  • Analyze and attribute expenditure: Understand where money is spent and who is responsible

Key practices:

  • Tag resources for cost allocation by project, team, environment
  • Use auto-scaling to match capacity to demand
  • Choose appropriate instance types and sizes
  • Use Spot instances for fault-tolerant workloads
  • Purchase reserved instances or savings plans for baseline capacity
  • Implement lifecycle policies for data storage (S3 Glacier, automated deletion)
  • Review Cost Explorer regularly and set budgets with alerts
  • Shut down non-production resources outside business hours

See Cost Optimization section below for detailed strategies.

Sustainability

The sustainability pillar focuses on minimizing environmental impact of cloud workloads.

Design principles:

  • Understand your impact: Measure and monitor cloud resource efficiency
  • Establish sustainability goals: Set long-term objectives for each workload
  • Maximize utilization: Right-size resources and minimize idle capacity
  • Anticipate and adopt efficient offerings: Use new, more efficient services and instances
  • Use managed services: Leverage AWS's efficiency at scale
  • Reduce downstream impact: Reduce data transfer and storage requirements

Key practices:

  • Right-size instances to avoid over-provisioning
  • Use auto-scaling to shut down unused capacity
  • Choose newer instance types with better performance-per-watt
  • Use Graviton (ARM-based) instances for better efficiency
  • Implement data lifecycle policies to delete unnecessary data
  • Use efficient architectures (serverless, containers) rather than always-on VMs
  • Optimize code and queries to reduce CPU/memory consumption

AWS's infrastructure is already more efficient than typical on-premises data centers due to economies of scale and renewable energy investments. Optimization benefits both cost and sustainability.


Service Quotas and Limits

AWS imposes limits on resources to protect the platform and prevent runaway costs from misconfigurations. Understanding and managing these limits prevents deployment failures.

Types of Limits

Soft limits (service quotas): Can be increased by requesting quota increases through the console or API. Most operational limits fall into this category.

Examples:

  • EC2 instances per region (default: 20-100 depending on type)
  • RDS database instances (default: 40 per region)
  • VPCs per region (default: 5)
  • Load balancers per region (default: 50)

Hard limits: Fixed limits that cannot be increased. These are fundamental to service design.

Examples:

  • S3 bucket names must be globally unique
  • Maximum object size in S3: 5TB
  • Lambda function timeout: 15 minutes maximum
  • Maximum VPC CIDR size: /16 (65,536 IPs)

Managing Quotas

Monitor usage proactively: AWS Service Quotas service provides a dashboard showing usage against limits. Set CloudWatch alarms when approaching quotas (e.g., alert at 80% of limit).

Request increases before you need them: Quota increase requests can take hours or days. Don't wait until you hit limits during a critical deployment.

Design for limits: Understand service limits during architecture design. For example, if you need 200 EC2 instances but the default limit is 20, plan quota increases early or consider alternatives (ECS, Lambda, auto-scaling groups).

Common quota issues:

  • Elastic IPs: Default limit is 5 per region. Use load balancers instead of assigning EIPs to every instance
  • VPC peering connections: Limited per VPC. Consider Transit Gateway for complex networking
  • API rate limits: Throttling occurs if you exceed API call rates. Implement exponential backoff and batch operations where possible

AWS Support Tiers

AWS offers multiple support tiers with different response times, access to support engineers, and additional services.

TierCostUse CaseTAMResponse Time (Critical)
BasicFreeLearning, testingNoNone
Developer$29+/monthDevelopment environmentsNo12 business hours
Business$100+/month or 10% of spendProduction workloadsNo1 hour
Enterprise On-Ramp$5,500+/monthBusiness-criticalPool30 minutes
Enterprise$15,000+/monthMission-criticalDedicated15 minutes

Basic Support: Included with all accounts. Provides:

  • 24/7 access to customer service, documentation, whitepapers
  • AWS Personal Health Dashboard
  • Trusted Advisor (limited checks)

Adequate for learning and non-production experimentation.

Developer Support: For development and testing:

  • Business hours email access to Cloud Support Associates
  • General guidance on service usage
  • Limited architecture support

Not suitable for production - no 24/7 support or fast response times.

Business Support: Minimum recommended for production workloads:

  • 24/7 phone, web, and chat access to Cloud Support Engineers
  • Full Trusted Advisor checks (cost optimization, security, performance)
  • Infrastructure Event Management for additional cost
  • Third-party software support
  • API support for automation

Enterprise Support: For mission-critical workloads:

  • Technical Account Manager (TAM) for proactive guidance
  • Concierge Support Team for billing and account assistance
  • Infrastructure Event Management included
  • Operational reviews and recommendations
  • White-glove case routing to senior engineers

When to escalate:

Escalate support cases when:

  • Production outage with user impact (use "production system down" severity)
  • Security incidents requiring immediate attention
  • Quota increases needed urgently
  • Guidance needed on complex architectural decisions

Provide detailed information in tickets: error messages, timestamps, affected resources, troubleshooting already performed. Better context leads to faster resolution.


AWS Cost Management

Effective cost management is critical to preventing cloud spending from spiraling out of control. AWS provides extensive tools for visibility, budgeting, and optimization.

Cost Explorer

Cost Explorer provides visualization and analysis of your AWS spending.

Key features:

  • Historical spending analysis (up to 12 months)
  • Forecasting based on trends
  • Filtering by service, region, account, tags, instance type
  • Grouping and pivot tables for multi-dimensional analysis
  • Savings recommendations (reserved instances, right-sizing)

Best practices:

  • Review Cost Explorer weekly during initial deployments
  • Set up saved reports for common analysis (spend by environment, team, project)
  • Use monthly budgets to track spending trends
  • Identify cost anomalies early before they accumulate

Cost Allocation Tags

Tags enable granular cost tracking by applying metadata to resources.

Recommended tagging strategy:

Environment: production | staging | development
Project: project-name
Team: team-name
CostCenter: business-unit-identifier
Owner: engineer-email
CreatedBy: automation | manual
Purpose: description of resource purpose

Tag enforcement:

Use Service Control Policies or AWS Config rules to require tags on new resources:

  • Block resource creation without required tags
  • Automated tagging via Lambda for certain resources
  • Tag compliance dashboards showing untagged resources

Tags must be applied at creation for some resources (EC2, RDS). Use Infrastructure as Code to ensure consistent tagging (see Terraform).

AWS Budgets

AWS Budgets allows you to set custom cost and usage budgets with alerts when thresholds are exceeded.

Budget types:

  • Cost budgets: Alert when spending exceeds amount
  • Usage budgets: Alert when usage (hours, GB, requests) exceeds threshold
  • Reservation budgets: Track reserved instance utilization
  • Savings Plans budgets: Monitor savings plan utilization

Budget best practices:

  • Set account-level budgets for overall spending control
  • Set project/team-specific budgets using tags
  • Create alerts at 50%, 80%, and 100% of budget
  • Alert multiple recipients (engineers, managers, finance)
  • Use forecasted spend alerts to catch trends before month end

Automated actions:

Budgets can trigger automated responses:

  • Stop EC2 instances when budget is exceeded
  • Apply restrictive IAM policies to limit spending
  • Send notifications to SNS topics for custom workflows

Use automated actions cautiously - accidentally stopping production resources is worse than cost overruns.

Cost Optimization Strategies

Right-sizing: Analyze CloudWatch metrics to identify over-provisioned resources:

  • EC2 instances with low CPU utilization
  • RDS databases with excessive IOPS provisioned
  • Elastic IPs not attached to instances (charged when unused)

AWS Compute Optimizer provides right-sizing recommendations based on actual utilization patterns.

Reserved capacity: For steady-state workloads, reserved instances and savings plans offer 30-70% discounts:

  • Commit to 1 or 3 years
  • All upfront, partial upfront, or no upfront payment options
  • Apply automatically to matching usage

Start with on-demand to establish baseline usage, then reserve capacity for the stable baseline. Keep variable workloads on-demand or Spot.

Spot instances: Use Spot instances (up to 90% discount) for fault-tolerant workloads:

  • CI/CD build workers (see Pipelines)
  • Batch processing and data analysis
  • Development and test environments

Storage lifecycle policies: Implement S3 lifecycle policies to transition data to cheaper storage classes:

  • Frequent access: S3 Standard
  • Infrequent access (monthly): S3 Standard-IA or S3 One Zone-IA
  • Archival (yearly): S3 Glacier or S3 Glacier Deep Archive
  • Automated deletion of temporary data

Similarly, delete old EBS snapshots and clean up unused volumes.

Shutdown non-production resources: Stop or terminate resources outside business hours:

  • Development and staging environments (evenings, weekends)
  • Scheduled scaling down to zero instances overnight
  • Lambda-based automation to start/stop based on schedules

This can reduce non-production costs by 65-75% with minimal effort.

See our cost optimization guide for AWS-specific strategies and implementation details.


Getting Started with AWS

Account Setup Best Practices

When creating a new AWS account:

  1. Secure root account immediately:

    • Enable MFA on root account
    • Don't create access keys for root account
    • Use root account only for initial setup and billing - never for daily operations
    • Store root account credentials securely (password manager, secure vault)
  2. Create administrative IAM users:

    • Create IAM users for day-to-day administration
    • Assign appropriate permissions (least privilege, not full admin if possible)
    • Enable MFA on all privileged users
    • Use IAM roles for programmatic access, not long-lived keys
  3. Enable CloudTrail:

    • Create a trail logging all management events
    • Store logs in S3 with encryption and lifecycle policies
    • Enable log file validation
    • This provides audit trail for security and compliance
  4. Set up billing alerts:

    • Enable billing alerts in billing preferences
    • Create CloudWatch alarm for estimated charges
    • Set budget with alerts at reasonable thresholds
  5. Enable AWS Config:

    • Track resource configuration changes
    • Detect non-compliant resources
    • Maintain configuration history for audit
  6. Plan account structure:

    • Decide whether you need AWS Organizations and multiple accounts
    • For production usage, implement multi-account strategy from the start
    • Use Infrastructure as Code for consistent account baseline

Learning Path

Foundational services (learn first):

  1. IAM: Identity, roles, policies, security
  2. VPC: Networking, subnets, security groups
  3. EC2: Compute instances, auto-scaling
  4. S3: Object storage, lifecycle policies
  5. RDS: Managed relational databases

Intermediate services (after foundations): 6. ECS/Fargate or EKS: Container orchestration 7. Lambda: Serverless functions 8. API Gateway: API management 9. CloudWatch: Monitoring and logging 10. CloudFormation or Terraform: Infrastructure as Code

Advanced services (specialized needs): 11. SQS/SNS/EventBridge: Event-driven architecture 12. DynamoDB: NoSQL database 13. ElastiCache: In-memory caching 14. CloudFront: CDN and edge services 15. Step Functions: Workflow orchestration

Focus on understanding core services deeply before exploring specialty services. A well-architected application uses foundational services effectively rather than every service available.

Official Resources

  • AWS Documentation: Comprehensive service documentation with tutorials
  • AWS Training and Certification: Free digital training and paid courses
  • AWS Well-Architected Labs: Hands-on exercises implementing best practices
  • AWS Samples on GitHub: Example architectures and code
  • AWS Blog: Service announcements and best practices
  • AWS re:Invent Videos: Conference talks on advanced topics

Anti-Patterns and Common Mistakes

Using Root Account for Daily Operations

Problem: Using root account credentials for regular tasks creates security risks - root has unlimited access and cannot be restricted.

Solution: Create IAM users/roles for daily operations. Secure root account with MFA and use only for initial setup and specific tasks requiring root (changing support plan, closing account).

Single-AZ Deployments for Production

Problem: Deploying all resources in a single AZ creates vulnerability to AZ failures, eliminating AWS's built-in redundancy.

Solution: Always deploy production workloads across multiple AZs. The marginal cost is negligible compared to availability benefits.

No Cost Monitoring or Budgets

Problem: Not monitoring costs until receiving an unexpected bill allows spending to spiral out of control.

Solution: Enable billing alerts, create budgets, and review Cost Explorer weekly during initial deployments. Tag resources for cost allocation. Implement automated alerts for anomalies.

Over-Reliance on Single Region

Problem: Deploying only in us-east-1 because it's the cheapest region ignores latency for global users and creates single point of failure.

Solution: Deploy in regions close to users. For global applications, consider multi-region architecture with traffic routing (Route 53). For most applications, multi-AZ within one region is sufficient - don't prematurely adopt multi-region complexity.

Ignoring Service Limits

Problem: Hitting service quotas during critical deployments because limits weren't considered during design.

Solution: Review service quotas during architecture design. Request increases proactively. Monitor quota utilization and set alerts before reaching limits.

No Infrastructure as Code

Problem: Creating resources through console leads to configuration drift, lack of version control, and inability to reproduce environments.

Solution: Use Terraform or CloudFormation for all infrastructure (see Terraform). Treat infrastructure as code - versioned, reviewed, tested. Reserve console for read-only troubleshooting.


Further Reading

AWS Documentation