Skip to main content

Cloud Computing Fundamentals

Overview

Cloud computing represents a fundamental shift from traditional on-premises infrastructure to on-demand, scalable computing resources delivered over the internet. This guide covers platform-agnostic cloud principles that apply regardless of whether you're using AWS, Azure, Google Cloud, or other providers. Understanding these concepts is essential before diving into provider-specific implementations.

Cloud computing is not just about moving servers to someone else's data center - it's about fundamentally rethinking how we build, deploy, and scale software systems. The cloud enables new architectural patterns, development workflows, and operational models that weren't practical or possible with traditional infrastructure.

Core Principles

  • On-demand self-service: Provision resources programmatically without human intervention, enabling automation and rapid iteration
  • Broad network access: Access resources from anywhere using standard protocols, supporting distributed teams and global users
  • Resource pooling: Share physical infrastructure efficiently through virtualization while maintaining isolation and security boundaries
  • Rapid elasticity: Scale resources up or down automatically based on demand, paying only for what you use
  • Measured service: Monitor and meter resource usage transparently, enabling cost optimization and chargeback models

Cloud Service Models

Cloud services are typically categorized into four main models, each providing different levels of abstraction and control. Understanding when to use each model is crucial for making effective architectural decisions.

Infrastructure as a Service (IaaS)

IaaS provides virtualized computing resources over the internet. You get raw compute, storage, and networking primitives that you configure and manage yourself.

What you get:

  • Virtual machines (EC2, Compute Engine, Azure VMs)
  • Block storage (EBS, Persistent Disks)
  • Virtual networks (VPC, VNet)
  • Load balancers

What you manage:

  • Operating system installation, patching, and security
  • Runtime environment (Java, Node.js, Python)
  • Application deployment and configuration
  • Scaling and availability architecture
  • Security groups, firewalls, access controls

When to use IaaS:

  • You need full control over the operating system and installed software
  • You're migrating legacy applications with specific OS requirements
  • You need to run software that requires specific kernel modules or system configurations
  • You want maximum flexibility to optimize performance and cost at the infrastructure level

Trade-offs: IaaS gives you the most control but requires the most operational overhead. You're responsible for patching vulnerabilities, managing capacity, and handling failures. This model works well when you have specialized requirements that higher-level abstractions can't accommodate, but it comes with significant operational complexity.

The line between IaaS and PaaS is increasingly blurred - managed Kubernetes (like EKS or GKE) provides some platform capabilities while still giving you control over the underlying containers and orchestration.

Platform as a Service (PaaS)

PaaS abstracts away infrastructure concerns, letting you focus on application code. The platform handles operating systems, runtime environments, middleware, and often scaling.

What you get:

  • Managed runtime environments (App Engine, Elastic Beanstalk, Cloud Run)
  • Automatic scaling and load balancing
  • Integrated monitoring and logging
  • Built-in security patching
  • Development tools and CI/CD integration

What you manage:

  • Application code and dependencies
  • Configuration and environment variables
  • Data and database schemas
  • Application-level security and authentication

When to use PaaS:

  • You want to focus on business logic rather than infrastructure
  • You're building modern web applications or APIs with standard technology stacks
  • You need rapid deployment and iteration cycles
  • Your team lacks deep infrastructure expertise
  • You want built-in scalability without manual configuration

Trade-offs: PaaS reduces operational burden but limits configuration options. You're constrained by the platform's supported languages, frameworks, and deployment patterns. For many modern applications, these constraints are acceptable and even beneficial - they enforce best practices and prevent configuration drift.

PaaS is particularly powerful when combined with managed databases and message queues. For example, deploying a Spring Boot application to a PaaS with managed PostgreSQL and Redis means you're only responsible for application code - the platform handles everything else.

Software as a Service (SaaS)

SaaS delivers fully managed applications over the internet. Users access software through a web browser or API without managing any underlying infrastructure or application code.

Examples:

  • Gmail, Microsoft 365, Salesforce (user-facing applications)
  • Auth0, Okta (authentication services)
  • SendGrid, Twilio (developer services)
  • Stripe, PayPal (payment processing)

What you manage:

  • User access and permissions
  • Configuration and customization within the application
  • Data you input into the system
  • Integration with other systems

When to use SaaS:

  • You need common functionality that isn't a core differentiator (email, CRM, authentication)
  • You want zero operational overhead for specific capabilities
  • You need enterprise features (compliance, audit logs, SSO) without building them yourself
  • You're optimizing for speed to market over customization

Trade-offs: SaaS offers zero operational burden but maximum vendor lock-in. You're entirely dependent on the vendor's features, pricing, reliability, and roadmap. For non-differentiating capabilities like email or authentication, this trade-off is often worthwhile - building and maintaining these systems yourself is expensive and doesn't add business value.

When evaluating SaaS solutions, consider data portability, API access for integration, compliance certifications, and the vendor's financial stability and track record.

Function as a Service (FaaS) / Serverless

FaaS takes PaaS a step further by eliminating long-running servers entirely. You write functions that execute in response to events, and the platform handles everything else - provisioning, scaling, patching, and shutting down idle resources.

What you get:

  • Event-driven function execution (Lambda, Cloud Functions, Azure Functions)
  • Automatic scaling from zero to thousands of concurrent executions
  • Pay-per-invocation billing (no cost when idle)
  • Built-in fault tolerance and availability
  • Integrated event sources (HTTP, queues, databases, streams, schedules)

What you manage:

  • Function code (typically stateless and short-lived)
  • Event configuration and triggers
  • Memory allocation and timeout settings
  • Environment variables and secrets

When to use FaaS:

  • You're building event-driven architectures (see Event-Driven Architecture)
  • You have variable or unpredictable traffic patterns
  • You want to minimize costs for low-traffic services
  • You're building API backends, data processing pipelines, or automation tasks
  • You need rapid scaling without capacity planning

Trade-offs: FaaS introduces cold start latency (the delay when a function hasn't been invoked recently and must be initialized). For latency-sensitive applications, this can be problematic. You're also constrained by execution time limits (typically 15 minutes maximum) and statelessness - you can't maintain long-lived connections or in-memory state between invocations.

Despite these constraints, serverless is powerful for many use cases. For example, processing uploaded files, handling webhook callbacks, or scheduled data transformations are all excellent fits. The key is understanding the execution model: functions are short-lived, stateless, and event-driven.

Serverless beyond FaaS: "Serverless" has evolved beyond just functions. Managed databases (Aurora Serverless), container platforms (Fargate), and API gateways are all "serverless" in that they auto-scale and charge based on usage. The common thread is eliminating capacity planning and paying only for actual consumption.


Cloud Deployment Models

The deployment model defines where your infrastructure runs and who manages it. This decision impacts security, compliance, cost, and operational complexity.

Public Cloud

Public cloud providers operate massive shared infrastructure accessible to any customer over the internet. This is what most people mean when they say "the cloud."

Characteristics:

  • Shared physical infrastructure with logical isolation
  • Provider manages all hardware, networking, and data centers
  • Pay-as-you-go pricing with economies of scale
  • Global availability with dozens of regional data centers
  • Rapid innovation with new services released constantly

Advantages:

  • Cost efficiency: No capital expenditure for hardware; pay only for consumption
  • Scale: Effectively unlimited capacity that scales up or down on demand
  • Global reach: Deploy close to users worldwide with minimal effort
  • Innovation velocity: Access cutting-edge services without building them yourself
  • Reliability: Provider-managed redundancy and disaster recovery

Disadvantages:

  • Shared infrastructure may not meet certain compliance requirements
  • Limited control over physical security and hardware location
  • Potential for vendor lock-in (mitigated by using portable abstractions)
  • Internet connectivity is a single point of failure

Public cloud is the default choice for most modern applications unless specific constraints require alternatives.

Private Cloud

Private cloud runs on dedicated infrastructure, either in your own data center or in a hosted facility. You get cloud-like APIs and automation but with exclusive access to hardware.

Characteristics:

  • Dedicated physical infrastructure
  • You control hardware placement, network topology, and security boundaries
  • Cloud-like APIs for provisioning and management (OpenStack, VMware, etc.)
  • Typically more expensive than public cloud due to capital costs and operational overhead

When to use private cloud:

  • Regulatory requirements mandate data cannot leave specific geographic boundaries
  • You need control over physical hardware for security or compliance
  • You have sustained, predictable workloads where dedicated capacity is cost-effective
  • You're migrating from traditional data centers and need time to adopt cloud patterns

Reality check: Building and operating a private cloud is expensive and complex. You need expertise in networking, storage, virtualization, and automation. Many organizations overestimate the benefits and underestimate the costs. Unless you have compelling regulatory or scale reasons, public cloud is usually more practical.

Hybrid Cloud

Hybrid cloud integrates public and private infrastructure, allowing workloads to move between environments based on cost, performance, or compliance requirements.

Characteristics:

  • Seamless integration between on-premises and cloud environments
  • Workload portability across environments
  • Consistent management and security policies
  • Hybrid connectivity (VPN, Direct Connect, ExpressRoute)

When to use hybrid cloud:

  • You're migrating from on-premises to cloud gradually
  • Some data must remain on-premises due to regulation or latency
  • You want to burst to public cloud during demand spikes
  • You need disaster recovery with on-premises as primary or backup

Common patterns:

  • Development in cloud, production on-premises: Test and staging in public cloud for cost efficiency, production on-premises for compliance
  • Data processing in cloud, data storage on-premises: Leverage cloud compute for analytics while keeping sensitive data local
  • Disaster recovery: Primary workloads on-premises, fail over to cloud in disaster scenarios

Challenges: Hybrid cloud adds significant complexity - network connectivity, identity federation, security policy synchronization, and operational overhead across environments. Ensure the benefits justify this complexity. For more on connectivity patterns, see cloud provider-specific networking documentation.

Multi-Cloud

Multi-cloud means using services from multiple public cloud providers simultaneously. This is distinct from hybrid cloud (which involves on-premises infrastructure).

Motivations:

  • Avoid vendor lock-in: Don't depend on a single provider's pricing, features, or availability
  • Best-of-breed services: Use each provider's strengths (e.g., AWS for breadth, GCP for data analytics, Azure for Microsoft integration)
  • Regulatory compliance: Meet data residency requirements by distributing workloads geographically
  • Resilience: Reduce risk of provider-wide outages
  • Negotiating leverage: Credible threat of switching providers can improve pricing

Reality check: Multi-cloud sounds appealing but introduces enormous complexity. Each provider has different APIs, security models, networking paradigms, and operational tools. Your team must maintain expertise across multiple platforms. Portability comes at a cost - you can't use provider-specific managed services without forfeiting portability.

When multi-cloud makes sense:

  • Large enterprises with regulatory requirements for geographic distribution
  • Organizations with sufficient scale to justify dedicated teams per provider
  • Applications already built with high portability (Kubernetes-native, for example)
  • Strategic acquisitions that bring different cloud footprints

When to avoid multi-cloud:

  • Small to medium teams - the operational overhead outweighs benefits
  • Applications that benefit from deep integration with managed services
  • Startups optimizing for speed over resilience

Most organizations should default to a single primary cloud provider and use others selectively for specific capabilities (e.g., CDN, video encoding). True multi-cloud with workload distribution is only practical for large, mature engineering organizations.


Shared Responsibility Model

The shared responsibility model defines which security and operational tasks are the provider's responsibility and which are yours. This model applies to all cloud services but varies by service type.

Provider Responsibilities

Cloud providers are responsible for security OF the cloud - the physical infrastructure, hardware, networking, and foundational software.

What providers manage:

  • Physical security: Data center access controls, surveillance, hardware disposal
  • Infrastructure maintenance: Hardware replacement, network capacity, power and cooling
  • Platform security: Hypervisor security, host OS patching, infrastructure-level vulnerability management
  • Compliance certifications: SOC 2, ISO 27001, PCI-DSS, HIPAA (for infrastructure)

For higher-level services (PaaS, SaaS), providers take on additional responsibilities like runtime patching, database backups (if managed), and service availability.

Customer Responsibilities

Customers are responsible for security IN the cloud - everything you deploy, configure, and manage on top of the provider's infrastructure.

What you must manage:

  • Data protection: Encrypting sensitive data at rest and in transit, managing encryption keys, implementing backup strategies (see Data Protection)
  • Access control: Managing user identities, implementing least privilege access, configuring IAM policies (see Authorization)
  • Application security: Securing application code, patching vulnerabilities in dependencies, input validation (see Input Validation)
  • Network configuration: Configuring security groups, network ACLs, VPCs, and firewalls correctly
  • Operating system (for IaaS): Patching OS vulnerabilities, configuring host firewalls, managing system users
  • Compliance: Ensuring your use of cloud services meets regulatory requirements

Shared Responsibilities

Some areas are jointly managed, where both provider and customer have responsibilities.

Examples:

  • Encryption: Provider offers encryption capabilities (KMS, encryption algorithms); you decide what to encrypt and manage keys
  • Identity management: Provider supplies IAM infrastructure; you configure policies, roles, and access controls
  • Network security: Provider offers DDoS protection at infrastructure level; you configure application-level rate limiting and WAF rules (see Rate Limiting)
  • Patch management: Provider patches infrastructure; you patch guest OS (IaaS) and application dependencies

Why This Matters

Misunderstanding the shared responsibility model leads to security breaches. A common mistake is assuming the cloud provider secures your data automatically - they secure the infrastructure, but you must configure security controls correctly.

Real-world example: An S3 bucket (object storage) is physically secure and highly durable. But if you configure it as publicly accessible, anyone can download your data. The provider secured the storage infrastructure; you failed to configure access controls. This is your responsibility.

For detailed security implementation guidance, see our Security Overview.


Cloud-Native Principles

Cloud-native isn't just about running workloads in the cloud - it's about designing applications to leverage cloud capabilities like auto-scaling, distributed systems, and managed services.

The Twelve-Factor App

The Twelve-Factor App methodology defines best practices for building cloud-native applications. These principles ensure portability, scalability, and maintainability.

Key factors:

  1. Codebase: One codebase tracked in version control, many deployments (see Git Workflow)
  2. Dependencies: Explicitly declare and isolate dependencies (Maven, npm, Gradle)
  3. Config: Store configuration in environment variables, not code (see Secrets Management)
  4. Backing services: Treat databases, queues, and caches as attached resources accessible via URLs
  5. Build, release, run: Strictly separate build, release, and run stages (see Pipelines)
  6. Processes: Execute the app as stateless processes; store state in backing services
  7. Port binding: Export services via port binding (HTTP server embedded in app, not external web server)
  8. Concurrency: Scale out via the process model (horizontal scaling with multiple instances)
  9. Disposability: Fast startup and graceful shutdown for robustness and elasticity
  10. Dev/prod parity: Keep development, staging, and production as similar as possible
  11. Logs: Treat logs as event streams written to stdout (see Logging)
  12. Admin processes: Run admin/management tasks as one-off processes

Why these matter: These principles enable applications to scale horizontally, deploy rapidly, and run reliably in cloud environments. For example, storing state in processes prevents horizontal scaling, while embedding configuration in code requires redeployment for every config change.

Stateless Applications

Cloud-native applications should be stateless - any instance can handle any request, and no data is stored locally that can't be lost.

Stateless design patterns:

  • Store session data in distributed caches (Redis, Memcached) rather than in-memory (see Caching)
  • Use database transactions for state changes rather than multi-step in-memory operations
  • Design for process crashes - every operation should be idempotent or transactional
  • Avoid local file storage; use object storage (S3) or shared file systems for persistent data (see File Storage)

Benefits:

  • Horizontal scaling: Add instances without coordination or data migration
  • Resilience: Instance failures don't lose data or break user sessions
  • Zero-downtime deployments: Rolling updates work seamlessly when instances are interchangeable

Microservices Architecture

Cloud-native applications are often built as microservices - small, independently deployable services that communicate via APIs. This architecture enables:

  • Independent scaling: Scale services based on their specific load patterns
  • Technology diversity: Use different languages/frameworks per service based on requirements
  • Fault isolation: Failures in one service don't cascade to others
  • Team autonomy: Small teams own specific services end-to-end

However, microservices introduce complexity around distributed systems, network reliability, and operational overhead. Start with a well-structured monolith and decompose as scale or organizational needs demand. See Microservices for detailed patterns.

Containerization and Orchestration

Containers package applications with their dependencies into portable, immutable artifacts. Orchestration platforms (Kubernetes, ECS, Cloud Run) manage container lifecycle, scaling, and networking.

Why containers matter in the cloud:

  • Consistency: Same container runs identically in dev, staging, and production
  • Density: Run many containers per host, maximizing resource utilization
  • Speed: Fast startup times (seconds) enable rapid scaling and deployments
  • Portability: Containers run on any cloud provider or on-premises

See Docker for container best practices and Kubernetes for orchestration patterns.


Global Infrastructure Concepts

Cloud providers operate globally distributed infrastructure to enable low-latency access for users worldwide and high availability through redundancy.

Regions

A region is a geographic area containing multiple isolated data centers. Each region is completely independent from other regions.

Characteristics:

  • Geographic isolation: Regions are separated by hundreds of miles to protect against regional disasters
  • Independent infrastructure: Separate power grids, network connectivity, and operational teams
  • Compliance boundaries: Data in a region typically stays in that region (data sovereignty)
  • Latency zones: Choose regions close to users for low latency

When to use multiple regions:

  • Global user base: Deploy close to users in multiple geographies
  • Disaster recovery: Replicate data to another region for business continuity (see Resilience)
  • Compliance: Keep data in specific countries/regions due to regulatory requirements
  • High availability: Protect against region-wide outages (rare but possible)

Trade-offs: Multi-region architectures add complexity - data replication, cross-region networking costs, eventual consistency challenges, and operational overhead. Only adopt multi-region when benefits justify these costs.

Availability Zones (AZs)

Availability zones are isolated data centers within a region. They provide high availability without the complexity of multi-region deployments.

Characteristics:

  • Physical isolation: Separate buildings, power, cooling, and network connectivity
  • Low-latency connectivity: Sub-millisecond latency between AZs in the same region
  • Independent failure domains: Designed so that AZ failures don't cascade to other AZs
  • Synchronous replication: Fast enough for synchronous database replication (unlike regions)

Best practices:

  • Distribute resources: Run instances, containers, and databases across multiple AZs
  • Design for AZ failure: Assume any AZ can fail; ensure your application continues running
  • Load balancing: Use load balancers to distribute traffic across AZs automatically

Example architecture: A typical high-availability setup runs three application instances (one per AZ) behind a load balancer, with a database configured for multi-AZ failover. If one AZ fails, the load balancer routes traffic to remaining AZs, and the database fails over automatically.

For most applications, deploying across multiple AZs within a single region provides sufficient availability and disaster recovery without multi-region complexity.

Edge Locations

Edge locations are points of presence (PoPs) for content delivery networks (CDNs). They cache static content close to users for fast access.

Use cases:

  • Static asset delivery: Images, CSS, JavaScript, videos
  • API acceleration: Route API requests through optimized network paths
  • DDoS protection: Absorb malicious traffic at the edge before it reaches your infrastructure

CDNs are essential for global web applications, reducing latency and bandwidth costs. See Performance Optimization for caching strategies.


Cloud Cost Models

Cloud pricing is fundamentally different from traditional capital expenditure. Understanding cost models is essential for financial planning and optimization.

Pay-As-You-Go

The default cloud pricing model charges for actual resource usage (compute hours, storage GB, network transfer).

Advantages:

  • No upfront costs: Start small and scale as needed
  • Elasticity: Pay more during high-traffic periods, less during quiet periods
  • Experimentation: Try new technologies without capital approval
  • Granular billing: Understand costs per project, team, or customer

Challenges:

  • Unpredictable costs: Traffic spikes can cause unexpected bills
  • Waste: Unused resources (forgotten instances, over-provisioned capacity) still incur costs
  • Complexity: Hundreds of services with different pricing dimensions

Best practices:

  • Implement cost alerts and budgets to catch unexpected increases
  • Tag resources by project, environment, and team for cost allocation
  • Review and shut down unused resources regularly
  • Use auto-scaling to match capacity to demand

Reserved Capacity

For predictable workloads, reserved capacity (reserved instances, committed use discounts) offers significant discounts (30-70%) in exchange for multi-year commitments.

When to use reserved capacity:

  • Steady-state workloads (databases, baseline compute capacity)
  • Long-term projects with stable resource needs
  • Cost optimization after initial deployment proves resource requirements

Risks:

  • Commitment inflexibility - you pay whether you use the capacity or not
  • Technology changes may make reservations obsolete (e.g., moving to serverless)
  • Organizational changes (project cancellations, team restructures) can strand reservations

Strategy: Start with on-demand pricing to establish baseline usage, then reserve capacity for the stable baseline while using on-demand or spot for variable workloads.

Spot / Preemptible Instances

Spot instances offer steep discounts (up to 90%) for workloads that tolerate interruptions. Providers reclaim capacity when needed, terminating your instances with short notice (typically 30-120 seconds).

Ideal for:

  • Batch processing and data analysis
  • Fault-tolerant distributed systems (Spark, Hadoop)
  • CI/CD build workers (see Pipelines)
  • Development and test environments

Not suitable for:

  • Stateful applications without checkpointing
  • Latency-sensitive interactive workloads
  • Databases or primary application servers

Implementation patterns:

  • Checkpoint progress regularly so restarted work can resume
  • Use spot for worker nodes with on-demand for critical components
  • Implement graceful shutdown handlers to save state on termination notice

Data Transfer Costs

Data transfer between cloud regions, out to the internet, and sometimes between services within a region incurs charges. This is often overlooked but can be significant.

Cost optimization strategies:

  • Use CDNs to cache content near users, reducing origin traffic
  • Keep communicating services in the same region/AZ when possible
  • Use VPC endpoints or private networking to avoid internet transfer charges
  • Compress data before transfer
  • Batch operations to reduce API call frequency

For detailed cost optimization strategies, see provider-specific cost documentation.


Cloud Migration Strategies

Moving existing applications to the cloud requires a strategy that balances speed, risk, and long-term benefits. The "6 Rs" framework helps categorize migration approaches.

Rehost (Lift and Shift)

Move applications to the cloud with minimal changes - copy virtual machine images or re-install software on cloud VMs.

Advantages:

  • Fastest migration path
  • Lowest risk - application behavior doesn't change
  • Immediate infrastructure benefits (scalability, managed hardware)
  • Can optimize later after migration

Disadvantages:

  • Doesn't leverage cloud-native capabilities (auto-scaling, managed services)
  • Carries technical debt from legacy architecture
  • May cost more than on-premises without optimization

When to use:

  • Time-constrained data center exit
  • Applications with unclear documentation or expertise
  • Stable applications unlikely to need significant changes
  • First step in a phased modernization plan

Replatform (Lift and Reshape)

Migrate applications with minor optimizations to take advantage of cloud services - replace self-managed databases with managed services, use load balancers, implement auto-scaling.

Advantages:

  • Moderate effort with significant operational benefits
  • Reduce operational overhead (managed databases, patching)
  • Improve availability and scalability
  • Keep application logic mostly unchanged

Disadvantages:

  • Requires configuration changes and testing
  • May expose architectural issues (tight coupling, stateful designs)
  • Doesn't fully leverage cloud-native patterns

When to use:

  • Applications with clear upgrade paths (self-managed PostgreSQL → managed RDS)
  • Opportunities to eliminate undifferentiated heavy lifting (database administration, load balancer management)
  • Moderate risk tolerance with medium timeline

Example: Migrating a Java application from on-premises Tomcat and PostgreSQL to cloud VMs with managed PostgreSQL database service and application load balancer. The application code is mostly unchanged, but you've eliminated database administration and gained automatic failover.

Refactor (Re-architect)

Redesign applications to be cloud-native - adopt microservices, containerization, serverless, managed services, and modern development practices.

Advantages:

  • Maximize cloud benefits (elasticity, cost optimization, resilience)
  • Improve development velocity with modern tooling and practices (see Spring Boot, React)
  • Enable continuous delivery and frequent deployments (see Pipelines)
  • Better alignment with business needs through agility

Disadvantages:

  • Highest cost and timeline
  • Significant technical risk requiring deep expertise
  • Requires rewriting code and changing architecture
  • Team must learn new patterns and technologies

When to use:

  • Legacy applications blocking business agility
  • Significant technical debt prevents further development
  • Business drivers justify investment (scaling, new features, cost optimization)
  • Opportunity to modernize technology stack

Example: Decomposing a monolithic Java EE application into Spring Boot microservices running on Kubernetes, with PostgreSQL replaced by managed databases, REST APIs for inter-service communication (see API Design), and event-driven patterns for async workflows (see Event-Driven Architecture).

Repurchase (Replace with SaaS)

Replace custom-built applications with commercial SaaS products - CRM, HR systems, authentication, payment processing.

Advantages:

  • Eliminate custom code maintenance
  • Get enterprise features (compliance, SSO, audit logs) out-of-the-box
  • Faster time to value
  • Predictable subscription costs

Disadvantages:

  • Vendor lock-in and dependency
  • Limited customization
  • Data migration from existing systems
  • Potential loss of competitive differentiation

When to use:

  • Non-differentiating capabilities (email, CRM, HR)
  • Business pressures to reduce IT headcount
  • Existing system is outdated with no upgrade path
  • Compliance or security features are needed quickly

Retain (Keep On-Premises)

Decide explicitly not to migrate certain applications - keep them on-premises or in existing hosting.

When to retain:

  • Regulatory constraints prevent cloud usage
  • Application has short remaining lifespan (< 1 year)
  • Migration risks or costs outweigh benefits
  • Application requires hardware dependencies not available in cloud (specialized equipment)

Retire (Decommission)

Shut down applications that are no longer needed.

Benefits:

  • Reduce operational costs and complexity
  • Improve security by eliminating old, unpatched systems
  • Simplify infrastructure and focus resources

Cloud migration projects often reveal unused or redundant applications. Decommissioning them before migration saves money and effort.


Multi-Cloud vs Single-Cloud Strategy

Choosing between single-cloud and multi-cloud is a strategic decision with profound implications for architecture, operations, and cost.

Single-Cloud Strategy

Commit primarily to one cloud provider, using their services deeply and taking advantage of managed offerings.

Advantages:

  • Deep integration: Use provider-specific managed services (databases, queues, AI/ML, analytics) without compatibility layers
  • Operational simplicity: One set of APIs, tools, security models, and billing
  • Lower cost: Avoid duplication and abstraction overhead; qualify for volume discounts
  • Team expertise: Build deep knowledge of one platform rather than surface knowledge of many

Disadvantages:

  • Vendor lock-in: Switching providers becomes expensive due to proprietary service dependencies
  • Price sensitivity: Limited negotiating leverage with a single vendor
  • Provider outage risk: Complete dependence on one provider's availability

When to choose single-cloud:

  • Small to medium engineering teams
  • Applications benefit from managed service integration
  • Speed and simplicity are priorities
  • Cost optimization matters more than theoretical portability

Multi-Cloud Strategy

Distribute workloads across multiple cloud providers to avoid dependency on any single vendor.

Advantages:

  • Avoid vendor lock-in: Credible exit option provides negotiating leverage
  • Best-of-breed services: Use each provider's strengths selectively
  • Geographic compliance: Meet data residency requirements through provider choice
  • Resilience: Mitigate risk of provider-wide outages

Disadvantages:

  • Operational complexity: Multiple APIs, tools, security models, billing systems
  • Higher costs: Duplication, abstraction layers, cross-cloud networking, multiple teams
  • Limited managed service usage: Portability requires using lowest-common-denominator services (VMs, containers, object storage)
  • Team expertise dilution: Engineers must know multiple platforms

When to choose multi-cloud:

  • Large enterprises with dedicated platform teams
  • Regulatory requirements mandate geographic distribution across providers
  • Existing organizational complexity (acquisitions, divisions with different providers)
  • Strategic priority to avoid vendor dependency justifies cost

Practical Middle Ground

Most organizations should start with a primary cloud provider and be strategic about multi-cloud.

Pragmatic multi-cloud patterns:

  • Primary cloud + CDN: Use one provider for compute/data, another for content delivery (CDN providers are interchangeable)
  • Primary cloud + specialty services: Use one provider for infrastructure, another for specific capabilities (AI/ML, video processing) not easily replicated
  • Kubernetes for portability: Use Kubernetes as an abstraction layer, but accept that storage, networking, and managed services are still provider-specific

Avoid:

  • Active-active workloads distributed across providers for "resilience" - the complexity and cost rarely justify the marginal availability improvement
  • Building custom abstraction layers to make code portable - you'll spend more on abstraction than you'd save by switching

When NOT to Use Cloud

Cloud computing is powerful but not universally appropriate. Understand when on-premises or hybrid solutions are better.

Regulatory and Data Sovereignty Constraints

Some regulations require data to remain in specific jurisdictions or prohibit third-party processing.

Examples:

  • Financial regulations requiring on-premises processing for certain transaction types
  • Healthcare data laws (HIPAA) requiring specific controls not easily demonstrated in shared infrastructure
  • Government contracts mandating on-premises or government-only cloud environments

Options:

  • Use public cloud regions within required jurisdictions
  • Implement hybrid cloud with sensitive data on-premises
  • Use provider compliance certifications (FedRAMP, HIPAA-eligible services) if acceptable

Cost at Scale

For sustained, predictable workloads at very large scale, owning hardware can be cheaper than renting compute capacity.

Break-even considerations:

  • Cloud economics favor variable workloads - idle capacity is waste in owned data centers
  • Capital costs, facilities management, and operational staff are significant
  • Cloud providers' economies of scale often outweigh your ability to buy cheaper hardware

Reality: Few organizations have sufficient scale, expertise, and consistent workloads to justify building data centers. If you're not running tens of thousands of servers with high utilization, cloud is likely cheaper.

Specialized Hardware Requirements

Applications requiring custom hardware (GPUs for specific workloads, FPGAs, exotic storage arrays) may not have cloud equivalents.

Options:

  • Check for cloud provider specialty instances (GPU instances, high-memory, high-storage)
  • Use hybrid approach - specialized hardware on-premises, standard workloads in cloud
  • Evaluate whether requirements are truly specialized or based on legacy assumptions

Latency-Sensitive Applications

If you need single-digit millisecond latency to on-premises systems or hardware, cloud networking delays may be unacceptable.

Examples:

  • High-frequency trading systems colocated with exchanges
  • Industrial control systems requiring real-time response to sensors
  • Legacy applications with tight coupling to on-premises systems during migration

Options:

  • Use cloud edge computing services to get closer to data sources
  • Implement hybrid connectivity with dedicated links (Direct Connect, ExpressRoute)
  • Re-architect to tolerate higher latency or use async communication patterns

Cloud vs On-Premises Decision Framework

Use this framework to evaluate whether cloud, on-premises, or hybrid is appropriate for specific workloads.

Decision criteria:

  1. Compliance and regulation: Can data and processing occur in public cloud?
  2. Workload characteristics: Variable traffic benefits from cloud elasticity; steady traffic may be cheaper on-premises at scale
  3. Organizational capability: Do you have expertise to build and operate data centers?
  4. Speed and agility: Cloud enables faster development and deployment
  5. Cost sensitivity: Analyze total cost of ownership (TCO) including staff, facilities, and opportunity cost

For most modern applications, cloud is the right default choice. Deviate only when specific constraints justify the complexity of on-premises infrastructure.


Anti-Patterns and Common Mistakes

Lift and Shift Without Optimization

Problem: Migrating applications to cloud VMs without leveraging cloud capabilities results in high costs and operational overhead without benefits.

Solution: Treat migration as an opportunity to modernize - adopt managed services, auto-scaling, and cloud-native patterns even if not a complete refactor.

Over-Engineering for Portability

Problem: Building custom abstraction layers to avoid "vendor lock-in" adds complexity and cost without commensurate benefit. You spend more on abstraction than you'd save by switching.

Solution: Use cloud-native services that provide value. Design for portability only if you have a specific, near-term plan to change providers. Use open standards (Kubernetes, OpenAPI) where practical, but don't avoid managed services entirely.

Ignoring Cost Management

Problem: Treating cloud as unlimited resources leads to cost overruns - unused resources, over-provisioned capacity, inefficient architectures.

Solution: Implement cost allocation tags, budgets, and alerts from day one. Review spending regularly, shut down unused resources, and right-size instances. Foster a cost-conscious culture.

Multi-Cloud for the Wrong Reasons

Problem: Adopting multi-cloud for theoretical resilience or to "avoid lock-in" without considering operational complexity.

Solution: Choose multi-cloud only when benefits (geographic compliance, provider outage risk mitigation at massive scale) clearly outweigh the substantial costs and complexity.

Neglecting Security Shared Responsibility

Problem: Assuming the cloud provider secures your data and applications automatically. Misconfigured access controls, public storage buckets, and unencrypted data are common.

Solution: Understand the shared responsibility model. Implement least privilege access, encrypt sensitive data, configure network security correctly, and audit configurations regularly (see Security Overview).

Single-AZ Deployments

Problem: Running applications in a single availability zone to save costs or complexity, eliminating resilience against AZ failures.

Solution: Deploy across multiple AZs within a region for high availability. The marginal cost and complexity are small compared to downtime risk.


Further Reading

Books and Articles

  • "Cloud Native Patterns" by Cornelia Davis: Design patterns for resilient cloud applications
  • "The Practice of Cloud System Administration" by Limoncelli et al.: Operational best practices for cloud infrastructure
  • AWS Well-Architected Framework / Azure Well-Architected Framework / Google Cloud Architecture Framework: Provider-specific best practices

External Resources