AWS Overview and Foundation

Overview

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. This guide provides foundational knowledge for building production applications on AWS, covering infrastructure concepts, account organization, cost management, and architectural best practices.

Understanding AWS fundamentals is essential before implementing specific services. This document establishes core concepts that apply across all AWS services - global infrastructure, account structure, the shared responsibility model, and operational principles that guide decision-making.

AWS's breadth can be overwhelming. Start with foundational services (compute, storage, networking, databases) and expand to specialized services as needs arise. This guide focuses on concepts relevant to enterprise application development rather than attempting comprehensive service coverage.

Core Principles

Global infrastructure: Deploy applications in multiple geographic regions with built-in redundancy through availability zones
Well-Architected Framework: Design systems using five pillars - operational excellence, security, reliability, performance efficiency, and cost optimization
Shared responsibility: AWS secures infrastructure; you secure your applications, data, and access controls
API-driven automation: Every AWS service exposes APIs enabling Infrastructure as Code and programmatic management
Pay-for-use pricing: Pay only for resources consumed with no upfront commitments, enabling cost optimization through right-sizing

AWS Global Infrastructure

AWS operates the largest global cloud infrastructure, purpose-built for high availability, fault tolerance, and low latency. Understanding this infrastructure is critical for architecting resilient applications.

Regions

A region is a physical location around the world where AWS clusters data centers. Each region is completely independent and isolated from other regions.

Key characteristics:

Geographic isolation: Regions are separated by significant distances (typically hundreds of miles) to protect against natural disasters, power outages, or regional events
Data sovereignty: Data stored in a region stays in that region unless you explicitly replicate it elsewhere, supporting compliance with data residency laws
Service availability: Not all AWS services are available in all regions; newer services typically launch in larger regions first
Pricing variation: Costs differ by region based on local infrastructure costs, energy prices, and market conditions
Independent control planes: Each region has separate API endpoints and management infrastructure

Region naming convention:

Format: <geographic-area>-<sub-region>-<number>
Examples: us-east-1 (Northern Virginia), eu-west-1 (Ireland), ap-southeast-2 (Sydney)

Selecting regions:

When choosing regions, consider:

Latency to end users: Deploy close to your user base for low latency. A user in Tokyo will experience significantly faster response times from ap-northeast-1 than from us-east-1
Compliance requirements: Regulations may mandate data remain in specific jurisdictions (GDPR in EU, data localization laws in China, financial regulations)
Service availability: Ensure required services are available in target regions
Cost: Prices vary by region; us-east-1 is typically cheapest but may not be optimal for your users
Disaster recovery: Choose secondary regions for failover that are geographically distant but still accessible for operations

Multi-region architecture:

Most applications start in a single region. Multi-region deployment adds significant complexity - data replication, routing, failover logic, and operational overhead. Adopt multi-region only when:

You have a global user base requiring low latency worldwide
Business continuity requirements demand protection against regional outages
Compliance requires data in multiple geographies

See Cloud Fundamentals for general multi-region patterns and our AWS-specific cell-based architecture guide for advanced multi-region designs.

Availability Zones (AZs)

Availability zones are one or more discrete data centers within a region, each with redundant power, networking, and connectivity. AZs are the primary mechanism for achieving high availability within AWS.

Key characteristics:

Physical separation: Each AZ is housed in a separate facility with independent infrastructure to prevent single points of failure
Low-latency connectivity: AZs within a region are connected via high-bandwidth, low-latency private fiber optic networking (typically < 2ms round-trip)
Synchronous replication: Low enough latency for synchronous database replication without performance degradation
Separate failure domains: Designed so failures (power, cooling, network) in one AZ don't cascade to others
Redundant connectivity: Multiple network paths to the internet and AWS backbone

Practical implications:

Every region has at least three AZs (some have six or more). When you launch resources, you specify which AZ to use or let AWS distribute them automatically.

High availability design pattern:

The standard pattern for high availability is:

Deploy resources across at least three AZs
Use load balancers to distribute traffic across AZs
Configure databases for multi-AZ automatic failover
Design applications to handle AZ failures gracefully

In this architecture, if AZ-A fails completely:

The load balancer stops routing to App1
The database automatically promotes DB2 or DB3 to primary
Users experience no downtime beyond brief connection resets

Cost considerations:

Deploying across AZs is essentially free - compute and storage costs are the same regardless of AZ. However, data transfer between AZs incurs small charges ($0.01-0.02/GB). For most applications, this cost is negligible compared to the availability benefits.

AZ naming inconsistency:

AZ names (us-east-1a, us-east-1b) map to different physical locations for different AWS accounts. This prevents all customers from clustering in "AZ-A." Use AZ IDs (use1-az1) when coordinating across accounts, though this is rarely necessary.

Edge Locations and Content Delivery

Edge locations are AWS points of presence (PoPs) distributed globally for content delivery and network optimization.

Services using edge locations:

CloudFront: Content delivery network (CDN) caching static and dynamic content close to users
Route 53: DNS service with global distribution for low-latency name resolution
AWS Global Accelerator: Network optimization routing traffic through AWS's private network
AWS WAF: Web application firewall deployed at edge for DDoS protection

AWS operates over 450 edge locations across 90+ cities worldwide - far more than the 33+ regions. This enables low-latency content delivery even for applications deployed in a single region.

How edge locations work:

When a user requests content served through CloudFront:

Request goes to the nearest edge location
If content is cached (cache hit), edge serves it directly with minimal latency
If not cached (cache miss), edge fetches from origin (your application/storage in a region)
Edge caches content for subsequent requests

This pattern dramatically reduces latency for static assets (images, videos, JavaScript, CSS) and reduces load on origin servers. See our CloudFront guide for implementation details.

Wavelength Zones and Local Zones

AWS offers additional infrastructure types for specialized use cases.

Wavelength Zones: Wavelength embeds AWS compute and storage inside telecommunications providers' 5G networks, enabling ultra-low latency for mobile edge applications.

Use cases:

AR/VR applications requiring < 10ms latency
Real-time gaming
Live video streaming with low lag
Industrial IoT with immediate response requirements

Local Zones: Local Zones extend AWS regions into metropolitan areas, providing single-digit millisecond latency to nearby users.

Use cases:

Media and entertainment workloads requiring low latency rendering
Live video streaming and production
Real-time multiplayer gaming
Hybrid cloud extending on-premises data centers

Most applications don't need Wavelength or Local Zones. Standard multi-AZ deployments in regions provide sufficient availability and performance. Explore these only if you have specific ultra-low-latency requirements.

AWS Account Structure and Organizations

Proper account structure is fundamental to security, cost management, and organizational governance. AWS Organizations enables centralized management of multiple AWS accounts.

Single Account vs Multi-Account Strategy

Single account: Simple to manage but creates challenges as organizations grow:

Difficult to isolate environments (dev, staging, production)
No blast radius containment - compromised credentials can access everything
Complex IAM policies trying to segregate access
Mixed billing makes cost allocation difficult

Multi-account strategy: The recommended approach uses separate AWS accounts for different purposes:

Improved security: Account boundaries provide strong isolation; compromised credentials in dev can't access production
Clear cost allocation: Each account has separate billing, making costs transparent per environment/team
Blast radius containment: Resource limits and failures are contained within accounts
Regulatory compliance: Separate accounts for PCI/HIPAA workloads simplify audit scope
Simplified IAM: Policies within accounts are simpler when you don't need to prevent access to other environments

AWS Organizations

AWS Organizations is a free service that enables centralized management of multiple AWS accounts.

Key features:

Consolidated billing: All accounts roll up to a single bill, enabling volume discounts and unified cost tracking
Service Control Policies (SCPs): Define guardrails that apply across accounts (e.g., "no one can disable CloudTrail logging")
Organizational units (OUs): Group accounts logically (by environment, business unit, or compliance zone) and apply policies hierarchically
Account creation automation: Programmatically create accounts with consistent baselines
Centralized logging: Forward CloudTrail logs and AWS Config data to a central security account

Best practices:

Management account minimalism: Use the root organization account only for billing and account management - don't run workloads there
Separate security account: Centralize logs and security tooling in a dedicated account that workload accounts can't modify
Environment separation: At minimum, separate dev, staging, and production into different accounts
Team-based accounts: For larger organizations, give each team their own AWS account for autonomy
Automated provisioning: Use Infrastructure as Code (see Terraform) to create accounts with consistent networking, security, and observability configurations

Service Control Policies (SCPs)

SCPs are permission boundaries applied at the organization or OU level. They define the maximum permissions available in an account - even account administrators can't exceed SCP limits.

Common SCP use cases:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "cloudtrail:StopLogging",
        "cloudtrail:DeleteTrail"
      ],
      "Resource": "*"
    }
  ]
}

This SCP prevents anyone in the account from disabling CloudTrail audit logging, even with administrator permissions. This creates a security guardrail that can't be circumvented.

Other valuable SCPs:

Restrict deployments to approved regions (compliance)
Prevent deletion of encryption keys
Require encryption for S3 buckets
Block public internet access to sensitive services
Enforce tagging standards for cost allocation

SCPs are a powerful security control. Start with permissive policies and add restrictions as you identify risks. See our IAM guide for detailed policy management.

Multi-Account Cost Allocation

With consolidated billing, you can:

Track costs per account (automatically)
Apply cost allocation tags across accounts
Share reserved instance and savings plan discounts
Set budgets per account or organizational unit

Tag resources consistently across accounts with:

Environment: dev, staging, production
Team: team-name
Project: project-identifier
CostCenter: for chargeback to business units

See Cost Optimization below for detailed cost management strategies.

AWS Shared Responsibility Model

The shared responsibility model defines which security and compliance tasks AWS handles and which you must implement. This model applies universally but varies by service type.

What AWS Manages

AWS is responsible for securing the underlying infrastructure that runs all services:

Physical and environmental:

Data center physical security (access control, surveillance, biometrics)
Power and cooling redundancy
Hardware lifecycle and disposal (drives destroyed before leaving facilities)
Physical network infrastructure

Infrastructure software:

Hypervisor security and isolation between customer instances
Managed service infrastructure (RDS host OS, Lambda execution environment)
Foundation services (EC2, S3, VPC networking hardware)
Global network backbone

Compliance infrastructure: AWS maintains certifications (SOC 1/2/3, ISO 27001, PCI-DSS Level 1, FedRAMP, HIPAA) for infrastructure. You inherit these certifications but must configure services correctly to maintain compliance.

What You Manage

You are responsible for securing everything you deploy and configure on AWS infrastructure:

Identity and access management:

Creating and managing IAM users, roles, and policies
Implementing least privilege access controls
Enabling and enforcing MFA for privileged accounts
Rotating credentials and access keys regularly
Federating identities from corporate directories

See Authorization for access control patterns and our IAM guide for AWS-specific implementation.

Data protection:

Classifying data based on sensitivity
Encrypting data at rest using KMS or customer-managed keys
Encrypting data in transit using TLS
Managing encryption keys and rotation policies
Implementing backup and retention strategies
Secure data disposal when no longer needed

See Data Protection for encryption strategies and Secrets Management for credential handling.

Application security:

Writing secure code without vulnerabilities (SQL injection, XSS, CSRF)
Managing application dependencies and patching CVEs
Implementing input validation and output encoding
Securing APIs with authentication and rate limiting
Logging security events for audit and incident response

See Security Testing and Input Validation for application security practices.

Network configuration:

Designing VPC architecture with proper subnet isolation
Configuring security groups and network ACLs correctly
Not exposing resources publicly unless necessary
Implementing VPC endpoints for private service access
Monitoring network traffic with VPC Flow Logs

Our AWS networking guide covers these patterns in detail.

Operating system (for IaaS like EC2):

Installing OS security patches
Configuring host firewalls
Hardening OS configurations
Managing local users and SSH keys
Installing antivirus/intrusion detection if required

Compliance verification:

Ensuring your AWS configuration meets regulatory requirements
Generating compliance reports and evidence
Responding to audits
Implementing compensating controls as needed

Why This Matters

A common mistake is assuming AWS secures your applications and data automatically. AWS secures the infrastructure, but misconfigurations are your responsibility.

Real-world examples of customer responsibility failures:

Public S3 buckets: AWS provides granular access controls, but customers configure buckets as publicly accessible, exposing sensitive data
Overprivileged IAM policies: Granting AdministratorAccess when least privilege requires much narrower permissions
Unencrypted databases: AWS offers RDS encryption at rest, but it's not enabled by default - you must enable it
No MFA on privileged accounts: AWS supports MFA, but you must enforce it through policies

Security breaches due to customer misconfiguration are common and entirely preventable through proper configuration and governance. Treat AWS security as an active responsibility, not a passive benefit.

AWS Well-Architected Framework

The AWS Well-Architected Framework provides best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. This framework codifies lessons from thousands of customer architectures.

Operational Excellence

Operational excellence focuses on running and monitoring systems to deliver business value and continually improving processes and procedures.

Design principles:

Perform operations as code: Define infrastructure, configuration, and runbooks as code to enable versioning, testing, and automation (see Terraform)
Make frequent, small, reversible changes: Deploy changes incrementally to reduce blast radius and enable fast rollback
Refine operations procedures frequently: Regularly review and improve runbooks, automation, and incident response
Anticipate failure: Conduct pre-mortems, game days, and chaos engineering to identify weaknesses (see Chaos Engineering)
Learn from operational events: Conduct blameless post-mortems and share lessons across teams (see Incident Post-Mortems)

Key practices:

Use Infrastructure as Code for all resources - never click through console for production changes
Implement comprehensive observability with logs, metrics, and traces (see Observability)
Automate responses to common events (auto-scaling, automated remediation)
Test operational procedures regularly - don't wait for production incidents to discover gaps
Create dashboards showing business and technical metrics

Operational excellence is about treating operations with the same rigor as software development - versioned, tested, and continuously improved.

Security

The security pillar focuses on protecting information, systems, and assets while delivering business value.

Design principles:

Implement strong identity foundation: Implement least privilege access with centralized identity management and eliminate long-lived credentials (see Authorization)
Enable traceability: Monitor and audit all actions and changes to enable investigation and compliance
Apply security at all layers: Defense in depth - secure network, application, data, and infrastructure layers
Automate security best practices: Implement software-defined security controls that can be versioned and tested
Protect data in transit and at rest: Classify data and use encryption, tokenization, and access control
Keep people away from data: Eliminate direct access to data, using automated tools instead
Prepare for security events: Have incident response plans, tools, and practiced procedures

Key practices:

Enable MFA for all privileged accounts
Use IAM roles rather than long-lived access keys
Encrypt sensitive data at rest and in transit
Enable CloudTrail for audit logging across all accounts
Use AWS Config to monitor resource configurations
Implement automated compliance checking (AWS Security Hub, custom rules)
Conduct regular security assessments and penetration testing

See Security Overview for comprehensive security patterns.

Reliability

Reliability ensures a workload performs its intended function correctly and consistently when expected.

Design principles:

Automatically recover from failure: Monitor systems for KPI thresholds and trigger automated recovery
Test recovery procedures: Regularly test failover, backup restoration, and disaster recovery plans
Scale horizontally: Distribute requests across multiple small resources rather than one large resource
Stop guessing capacity: Use auto-scaling based on actual demand
Manage change through automation: Use Infrastructure as Code to manage changes and enable rollback

Key practices:

Deploy across multiple Availability Zones for high availability
Use load balancers to distribute traffic and detect failures
Implement auto-scaling for compute resources based on demand
Back up data regularly and test restoration procedures
Design for graceful degradation when dependencies fail (see Resilience)
Use exponential backoff and jitter for retries
Implement circuit breakers for external dependencies
Set appropriate resource quotas and limits

Availability targets:

Availability	Downtime per Year	Downtime per Month	Use Case
99%	3.65 days	7.3 hours	Internal tools
99.9%	8.76 hours	43.8 minutes	Standard applications
99.95%	4.38 hours	21.9 minutes	Business-critical
99.99%	52.6 minutes	4.38 minutes	Mission-critical

Design for the availability your business requires - achieving 99.99% is significantly more expensive than 99.9%.

Performance Efficiency

Performance efficiency focuses on using computing resources efficiently to meet requirements and maintaining efficiency as demand changes.

Design principles:

Democratize advanced technologies: Use managed services for complex capabilities (ML, analytics, databases)
Go global in minutes: Deploy in multiple regions to reduce latency for global users
Use serverless architectures: Eliminate operational burden and pay only for value delivered
Experiment more often: Virtual resources make experimentation cost-effective
Consider mechanical sympathy: Understand how services work to use them appropriately

Key practices:

Choose appropriate compute types for workload characteristics (CPU-optimized, memory-optimized, GPU)
Use caching at multiple layers to reduce redundant processing (see Caching)
Implement CDNs (CloudFront) for global content delivery
Use managed databases that handle scaling, replication, and optimization
Monitor performance metrics and set alarms for degradation
Regularly review and benchmark performance under realistic load (see Performance Testing)

See Performance Optimization for detailed optimization strategies.

Cost Optimization

Cost optimization focuses on avoiding unnecessary costs while achieving business outcomes.

Design principles:

Implement cloud financial management: Establish cost awareness, control, and optimization practices
Adopt a consumption model: Pay only for resources you consume
Measure overall efficiency: Monitor business outcomes per dollar spent
Stop spending on undifferentiated heavy lifting: Use managed services to reduce operational costs
Analyze and attribute expenditure: Understand where money is spent and who is responsible

Key practices:

Tag resources for cost allocation by project, team, environment
Use auto-scaling to match capacity to demand
Choose appropriate instance types and sizes
Use Spot instances for fault-tolerant workloads
Purchase reserved instances or savings plans for baseline capacity
Implement lifecycle policies for data storage (S3 Glacier, automated deletion)
Review Cost Explorer regularly and set budgets with alerts
Shut down non-production resources outside business hours

See Cost Optimization section below for detailed strategies.

Sustainability

The sustainability pillar focuses on minimizing environmental impact of cloud workloads.

Design principles:

Understand your impact: Measure and monitor cloud resource efficiency
Establish sustainability goals: Set long-term objectives for each workload
Maximize utilization: Right-size resources and minimize idle capacity
Anticipate and adopt efficient offerings: Use new, more efficient services and instances
Use managed services: Leverage AWS's efficiency at scale
Reduce downstream impact: Reduce data transfer and storage requirements

Key practices:

Right-size instances to avoid over-provisioning
Use auto-scaling to shut down unused capacity
Choose newer instance types with better performance-per-watt
Use Graviton (ARM-based) instances for better efficiency
Implement data lifecycle policies to delete unnecessary data
Use efficient architectures (serverless, containers) rather than always-on VMs
Optimize code and queries to reduce CPU/memory consumption

AWS's infrastructure is already more efficient than typical on-premises data centers due to economies of scale and renewable energy investments. Optimization benefits both cost and sustainability.

Service Quotas and Limits

AWS imposes limits on resources to protect the platform and prevent runaway costs from misconfigurations. Understanding and managing these limits prevents deployment failures.

Types of Limits

Soft limits (service quotas): Can be increased by requesting quota increases through the console or API. Most operational limits fall into this category.

Examples:

EC2 instances per region (default: 20-100 depending on type)
RDS database instances (default: 40 per region)
VPCs per region (default: 5)
Load balancers per region (default: 50)

Hard limits: Fixed limits that cannot be increased. These are fundamental to service design.

Examples:

S3 bucket names must be globally unique
Maximum object size in S3: 5TB
Lambda function timeout: 15 minutes maximum
Maximum VPC CIDR size: /16 (65,536 IPs)

Managing Quotas

Monitor usage proactively: AWS Service Quotas service provides a dashboard showing usage against limits. Set CloudWatch alarms when approaching quotas (e.g., alert at 80% of limit).

Request increases before you need them: Quota increase requests can take hours or days. Don't wait until you hit limits during a critical deployment.

Design for limits: Understand service limits during architecture design. For example, if you need 200 EC2 instances but the default limit is 20, plan quota increases early or consider alternatives (ECS, Lambda, auto-scaling groups).

Common quota issues:

Elastic IPs: Default limit is 5 per region. Use load balancers instead of assigning EIPs to every instance
VPC peering connections: Limited per VPC. Consider Transit Gateway for complex networking
API rate limits: Throttling occurs if you exceed API call rates. Implement exponential backoff and batch operations where possible

AWS Support Tiers

AWS offers multiple support tiers with different response times, access to support engineers, and additional services.

Tier	Cost	Use Case	TAM	Response Time (Critical)
Basic	Free	Learning, testing	No	None
Developer	$29+/month	Development environments	No	12 business hours
Business	$100+/month or 10% of spend	Production workloads	No	1 hour
Enterprise On-Ramp	$5,500+/month	Business-critical	Pool	30 minutes
Enterprise	$15,000+/month	Mission-critical	Dedicated	15 minutes

Basic Support: Included with all accounts. Provides:

24/7 access to customer service, documentation, whitepapers
AWS Personal Health Dashboard
Trusted Advisor (limited checks)

Adequate for learning and non-production experimentation.

Developer Support: For development and testing:

Business hours email access to Cloud Support Associates
General guidance on service usage
Limited architecture support

Not suitable for production - no 24/7 support or fast response times.

Business Support: Minimum recommended for production workloads:

24/7 phone, web, and chat access to Cloud Support Engineers
Full Trusted Advisor checks (cost optimization, security, performance)
Infrastructure Event Management for additional cost
Third-party software support
API support for automation

Enterprise Support: For mission-critical workloads:

Technical Account Manager (TAM) for proactive guidance
Concierge Support Team for billing and account assistance
Infrastructure Event Management included
Operational reviews and recommendations
White-glove case routing to senior engineers

When to escalate:

Escalate support cases when:

Production outage with user impact (use "production system down" severity)
Security incidents requiring immediate attention
Quota increases needed urgently
Guidance needed on complex architectural decisions

Provide detailed information in tickets: error messages, timestamps, affected resources, troubleshooting already performed. Better context leads to faster resolution.

AWS Cost Management

Effective cost management is critical to preventing cloud spending from spiraling out of control. AWS provides extensive tools for visibility, budgeting, and optimization.

Cost Explorer

Cost Explorer provides visualization and analysis of your AWS spending.

Key features:

Historical spending analysis (up to 12 months)
Forecasting based on trends
Filtering by service, region, account, tags, instance type
Grouping and pivot tables for multi-dimensional analysis
Savings recommendations (reserved instances, right-sizing)

Best practices:

Review Cost Explorer weekly during initial deployments
Set up saved reports for common analysis (spend by environment, team, project)
Use monthly budgets to track spending trends
Identify cost anomalies early before they accumulate

Cost Allocation Tags

Tags enable granular cost tracking by applying metadata to resources.

Recommended tagging strategy:

Environment: production | staging | development
Project: project-name
Team: team-name
CostCenter: business-unit-identifier
Owner: engineer-email
CreatedBy: automation | manual
Purpose: description of resource purpose

Tag enforcement:

Use Service Control Policies or AWS Config rules to require tags on new resources:

Block resource creation without required tags
Automated tagging via Lambda for certain resources
Tag compliance dashboards showing untagged resources

Tags must be applied at creation for some resources (EC2, RDS). Use Infrastructure as Code to ensure consistent tagging (see Terraform).

AWS Budgets

AWS Budgets allows you to set custom cost and usage budgets with alerts when thresholds are exceeded.

Budget types:

Cost budgets: Alert when spending exceeds amount
Usage budgets: Alert when usage (hours, GB, requests) exceeds threshold
Reservation budgets: Track reserved instance utilization
Savings Plans budgets: Monitor savings plan utilization

Budget best practices:

Set account-level budgets for overall spending control
Set project/team-specific budgets using tags
Create alerts at 50%, 80%, and 100% of budget
Alert multiple recipients (engineers, managers, finance)
Use forecasted spend alerts to catch trends before month end

Automated actions:

Budgets can trigger automated responses:

Stop EC2 instances when budget is exceeded
Apply restrictive IAM policies to limit spending
Send notifications to SNS topics for custom workflows

Use automated actions cautiously - accidentally stopping production resources is worse than cost overruns.

Cost Optimization Strategies

Right-sizing: Analyze CloudWatch metrics to identify over-provisioned resources:

EC2 instances with low CPU utilization
RDS databases with excessive IOPS provisioned
Elastic IPs not attached to instances (charged when unused)

AWS Compute Optimizer provides right-sizing recommendations based on actual utilization patterns.

Reserved capacity: For steady-state workloads, reserved instances and savings plans offer 30-70% discounts:

Commit to 1 or 3 years
All upfront, partial upfront, or no upfront payment options
Apply automatically to matching usage

Start with on-demand to establish baseline usage, then reserve capacity for the stable baseline. Keep variable workloads on-demand or Spot.

Spot instances: Use Spot instances (up to 90% discount) for fault-tolerant workloads:

CI/CD build workers (see Pipelines)
Batch processing and data analysis
Development and test environments

Storage lifecycle policies: Implement S3 lifecycle policies to transition data to cheaper storage classes:

Frequent access: S3 Standard
Infrequent access (monthly): S3 Standard-IA or S3 One Zone-IA
Archival (yearly): S3 Glacier or S3 Glacier Deep Archive
Automated deletion of temporary data

Similarly, delete old EBS snapshots and clean up unused volumes.

Shutdown non-production resources: Stop or terminate resources outside business hours:

Development and staging environments (evenings, weekends)
Scheduled scaling down to zero instances overnight
Lambda-based automation to start/stop based on schedules

This can reduce non-production costs by 65-75% with minimal effort.

See our cost optimization guide for AWS-specific strategies and implementation details.

Getting Started with AWS

Account Setup Best Practices

When creating a new AWS account:

Secure root account immediately:
- Enable MFA on root account
- Don't create access keys for root account
- Use root account only for initial setup and billing - never for daily operations
- Store root account credentials securely (password manager, secure vault)
Create administrative IAM users:
- Create IAM users for day-to-day administration
- Assign appropriate permissions (least privilege, not full admin if possible)
- Enable MFA on all privileged users
- Use IAM roles for programmatic access, not long-lived keys
Enable CloudTrail:
- Create a trail logging all management events
- Store logs in S3 with encryption and lifecycle policies
- Enable log file validation
- This provides audit trail for security and compliance
Set up billing alerts:
- Enable billing alerts in billing preferences
- Create CloudWatch alarm for estimated charges
- Set budget with alerts at reasonable thresholds
Enable AWS Config:
- Track resource configuration changes
- Detect non-compliant resources
- Maintain configuration history for audit
Plan account structure:
- Decide whether you need AWS Organizations and multiple accounts
- For production usage, implement multi-account strategy from the start
- Use Infrastructure as Code for consistent account baseline

Learning Path

Foundational services (learn first):

IAM: Identity, roles, policies, security
VPC: Networking, subnets, security groups
EC2: Compute instances, auto-scaling
S3: Object storage, lifecycle policies
RDS: Managed relational databases

Intermediate services (after foundations): 6. ECS/Fargate or EKS: Container orchestration 7. Lambda: Serverless functions 8. API Gateway: API management 9. CloudWatch: Monitoring and logging 10. CloudFormation or Terraform: Infrastructure as Code

Advanced services (specialized needs): 11. SQS/SNS/EventBridge: Event-driven architecture 12. DynamoDB: NoSQL database 13. ElastiCache: In-memory caching 14. CloudFront: CDN and edge services 15. Step Functions: Workflow orchestration

Focus on understanding core services deeply before exploring specialty services. A well-architected application uses foundational services effectively rather than every service available.

Official Resources

AWS Documentation: Comprehensive service documentation with tutorials
AWS Training and Certification: Free digital training and paid courses
AWS Well-Architected Labs: Hands-on exercises implementing best practices
AWS Samples on GitHub: Example architectures and code
AWS Blog: Service announcements and best practices
AWS re:Invent Videos: Conference talks on advanced topics

Anti-Patterns and Common Mistakes

Using Root Account for Daily Operations

Problem: Using root account credentials for regular tasks creates security risks - root has unlimited access and cannot be restricted.

Solution: Create IAM users/roles for daily operations. Secure root account with MFA and use only for initial setup and specific tasks requiring root (changing support plan, closing account).

Single-AZ Deployments for Production

Problem: Deploying all resources in a single AZ creates vulnerability to AZ failures, eliminating AWS's built-in redundancy.

Solution: Always deploy production workloads across multiple AZs. The marginal cost is negligible compared to availability benefits.

No Cost Monitoring or Budgets

Problem: Not monitoring costs until receiving an unexpected bill allows spending to spiral out of control.

Solution: Enable billing alerts, create budgets, and review Cost Explorer weekly during initial deployments. Tag resources for cost allocation. Implement automated alerts for anomalies.

Over-Reliance on Single Region

Problem: Deploying only in us-east-1 because it's the cheapest region ignores latency for global users and creates single point of failure.

Solution: Deploy in regions close to users. For global applications, consider multi-region architecture with traffic routing (Route 53). For most applications, multi-AZ within one region is sufficient - don't prematurely adopt multi-region complexity.

Ignoring Service Limits

Problem: Hitting service quotas during critical deployments because limits weren't considered during design.

Solution: Review service quotas during architecture design. Request increases proactively. Monitor quota utilization and set alerts before reaching limits.

No Infrastructure as Code

Problem: Creating resources through console leads to configuration drift, lack of version control, and inability to reproduce environments.

Solution: Use Terraform or CloudFormation for all infrastructure (see Terraform). Treat infrastructure as code - versioned, reviewed, tested. Reserve console for read-only troubleshooting.

Overview​

Core Principles​

AWS Global Infrastructure​

Regions​

Availability Zones (AZs)​

Edge Locations and Content Delivery​

Wavelength Zones and Local Zones​

AWS Account Structure and Organizations​

Single Account vs Multi-Account Strategy​

AWS Organizations​

Service Control Policies (SCPs)​

Multi-Account Cost Allocation​

AWS Shared Responsibility Model​

What AWS Manages​

What You Manage​

Why This Matters​

AWS Well-Architected Framework​

Operational Excellence​

Security​

Reliability​

Performance Efficiency​

Cost Optimization​

Sustainability​

Service Quotas and Limits​

Types of Limits​

Managing Quotas​

AWS Support Tiers​

AWS Cost Management​

Cost Explorer​

Cost Allocation Tags​

AWS Budgets​

Cost Optimization Strategies​

Getting Started with AWS​

Account Setup Best Practices​

Learning Path​

Official Resources​

Anti-Patterns and Common Mistakes​

Using Root Account for Daily Operations​

Single-AZ Deployments for Production​

No Cost Monitoring or Budgets​

Over-Reliance on Single Region​

Ignoring Service Limits​

No Infrastructure as Code​

Further Reading​

AWS Documentation​

Related Guidelines​

Overview

Core Principles

AWS Global Infrastructure

Regions

Availability Zones (AZs)

Edge Locations and Content Delivery

Wavelength Zones and Local Zones

AWS Account Structure and Organizations

Single Account vs Multi-Account Strategy

AWS Organizations

Service Control Policies (SCPs)

Multi-Account Cost Allocation

AWS Shared Responsibility Model

What AWS Manages

What You Manage

Why This Matters

AWS Well-Architected Framework

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

Service Quotas and Limits

Types of Limits

Managing Quotas

AWS Support Tiers

AWS Cost Management

Cost Explorer

Cost Allocation Tags

AWS Budgets

Cost Optimization Strategies

Getting Started with AWS

Account Setup Best Practices

Learning Path

Official Resources

Anti-Patterns and Common Mistakes

Using Root Account for Daily Operations

Single-AZ Deployments for Production

No Cost Monitoring or Budgets

Over-Reliance on Single Region

Ignoring Service Limits

No Infrastructure as Code

Further Reading

AWS Documentation

Related Guidelines