Cloud Computing Fundamentals
Overview
Cloud computing represents a fundamental shift from traditional on-premises infrastructure to on-demand, scalable computing resources delivered over the internet. This guide covers platform-agnostic cloud principles that apply regardless of whether you're using AWS, Azure, Google Cloud, or other providers. Understanding these concepts is essential before diving into provider-specific implementations.
Cloud computing is not just about moving servers to someone else's data center - it's about fundamentally rethinking how we build, deploy, and scale software systems. The cloud enables new architectural patterns, development workflows, and operational models that weren't practical or possible with traditional infrastructure.
Core Principles
- On-demand self-service: Provision resources programmatically without human intervention, enabling automation and rapid iteration
- Broad network access: Access resources from anywhere using standard protocols, supporting distributed teams and global users
- Resource pooling: Share physical infrastructure efficiently through virtualization while maintaining isolation and security boundaries
- Rapid elasticity: Scale resources up or down automatically based on demand, paying only for what you use
- Measured service: Monitor and meter resource usage transparently, enabling cost optimization and chargeback models
Cloud Service Models
Cloud services are typically categorized into four main models, each providing different levels of abstraction and control. Understanding when to use each model is crucial for making effective architectural decisions.
Infrastructure as a Service (IaaS)
IaaS provides virtualized computing resources over the internet. You get raw compute, storage, and networking primitives that you configure and manage yourself.
What you get:
- Virtual machines (EC2, Compute Engine, Azure VMs)
- Block storage (EBS, Persistent Disks)
- Virtual networks (VPC, VNet)
- Load balancers
What you manage:
- Operating system installation, patching, and security
- Runtime environment (Java, Node.js, Python)
- Application deployment and configuration
- Scaling and availability architecture
- Security groups, firewalls, access controls
When to use IaaS:
- You need full control over the operating system and installed software
- You're migrating legacy applications with specific OS requirements
- You need to run software that requires specific kernel modules or system configurations
- You want maximum flexibility to optimize performance and cost at the infrastructure level
Trade-offs: IaaS gives you the most control but requires the most operational overhead. You're responsible for patching vulnerabilities, managing capacity, and handling failures. This model works well when you have specialized requirements that higher-level abstractions can't accommodate, but it comes with significant operational complexity.
The line between IaaS and PaaS is increasingly blurred - managed Kubernetes (like EKS or GKE) provides some platform capabilities while still giving you control over the underlying containers and orchestration.
Platform as a Service (PaaS)
PaaS abstracts away infrastructure concerns, letting you focus on application code. The platform handles operating systems, runtime environments, middleware, and often scaling.
What you get:
- Managed runtime environments (App Engine, Elastic Beanstalk, Cloud Run)
- Automatic scaling and load balancing
- Integrated monitoring and logging
- Built-in security patching
- Development tools and CI/CD integration
What you manage:
- Application code and dependencies
- Configuration and environment variables
- Data and database schemas
- Application-level security and authentication
When to use PaaS:
- You want to focus on business logic rather than infrastructure
- You're building modern web applications or APIs with standard technology stacks
- You need rapid deployment and iteration cycles
- Your team lacks deep infrastructure expertise
- You want built-in scalability without manual configuration
Trade-offs: PaaS reduces operational burden but limits configuration options. You're constrained by the platform's supported languages, frameworks, and deployment patterns. For many modern applications, these constraints are acceptable and even beneficial - they enforce best practices and prevent configuration drift.
PaaS is particularly powerful when combined with managed databases and message queues. For example, deploying a Spring Boot application to a PaaS with managed PostgreSQL and Redis means you're only responsible for application code - the platform handles everything else.
Software as a Service (SaaS)
SaaS delivers fully managed applications over the internet. Users access software through a web browser or API without managing any underlying infrastructure or application code.
Examples:
- Gmail, Microsoft 365, Salesforce (user-facing applications)
- Auth0, Okta (authentication services)
- SendGrid, Twilio (developer services)
- Stripe, PayPal (payment processing)
What you manage:
- User access and permissions
- Configuration and customization within the application
- Data you input into the system
- Integration with other systems
When to use SaaS:
- You need common functionality that isn't a core differentiator (email, CRM, authentication)
- You want zero operational overhead for specific capabilities
- You need enterprise features (compliance, audit logs, SSO) without building them yourself
- You're optimizing for speed to market over customization
Trade-offs: SaaS offers zero operational burden but maximum vendor lock-in. You're entirely dependent on the vendor's features, pricing, reliability, and roadmap. For non-differentiating capabilities like email or authentication, this trade-off is often worthwhile - building and maintaining these systems yourself is expensive and doesn't add business value.
When evaluating SaaS solutions, consider data portability, API access for integration, compliance certifications, and the vendor's financial stability and track record.
Function as a Service (FaaS) / Serverless
FaaS takes PaaS a step further by eliminating long-running servers entirely. You write functions that execute in response to events, and the platform handles everything else - provisioning, scaling, patching, and shutting down idle resources.
What you get:
- Event-driven function execution (Lambda, Cloud Functions, Azure Functions)
- Automatic scaling from zero to thousands of concurrent executions
- Pay-per-invocation billing (no cost when idle)
- Built-in fault tolerance and availability
- Integrated event sources (HTTP, queues, databases, streams, schedules)
What you manage:
- Function code (typically stateless and short-lived)
- Event configuration and triggers
- Memory allocation and timeout settings
- Environment variables and secrets
When to use FaaS:
- You're building event-driven architectures (see Event-Driven Architecture)
- You have variable or unpredictable traffic patterns
- You want to minimize costs for low-traffic services
- You're building API backends, data processing pipelines, or automation tasks
- You need rapid scaling without capacity planning
Trade-offs: FaaS introduces cold start latency (the delay when a function hasn't been invoked recently and must be initialized). For latency-sensitive applications, this can be problematic. You're also constrained by execution time limits (typically 15 minutes maximum) and statelessness - you can't maintain long-lived connections or in-memory state between invocations.
Despite these constraints, serverless is powerful for many use cases. For example, processing uploaded files, handling webhook callbacks, or scheduled data transformations are all excellent fits. The key is understanding the execution model: functions are short-lived, stateless, and event-driven.
Serverless beyond FaaS: "Serverless" has evolved beyond just functions. Managed databases (Aurora Serverless), container platforms (Fargate), and API gateways are all "serverless" in that they auto-scale and charge based on usage. The common thread is eliminating capacity planning and paying only for actual consumption.
Cloud Deployment Models
The deployment model defines where your infrastructure runs and who manages it. This decision impacts security, compliance, cost, and operational complexity.
Public Cloud
Public cloud providers operate massive shared infrastructure accessible to any customer over the internet. This is what most people mean when they say "the cloud."
Characteristics:
- Shared physical infrastructure with logical isolation
- Provider manages all hardware, networking, and data centers
- Pay-as-you-go pricing with economies of scale
- Global availability with dozens of regional data centers
- Rapid innovation with new services released constantly
Advantages:
- Cost efficiency: No capital expenditure for hardware; pay only for consumption
- Scale: Effectively unlimited capacity that scales up or down on demand
- Global reach: Deploy close to users worldwide with minimal effort
- Innovation velocity: Access cutting-edge services without building them yourself
- Reliability: Provider-managed redundancy and disaster recovery
Disadvantages:
- Shared infrastructure may not meet certain compliance requirements
- Limited control over physical security and hardware location
- Potential for vendor lock-in (mitigated by using portable abstractions)
- Internet connectivity is a single point of failure
Public cloud is the default choice for most modern applications unless specific constraints require alternatives.
Private Cloud
Private cloud runs on dedicated infrastructure, either in your own data center or in a hosted facility. You get cloud-like APIs and automation but with exclusive access to hardware.
Characteristics:
- Dedicated physical infrastructure
- You control hardware placement, network topology, and security boundaries
- Cloud-like APIs for provisioning and management (OpenStack, VMware, etc.)
- Typically more expensive than public cloud due to capital costs and operational overhead
When to use private cloud:
- Regulatory requirements mandate data cannot leave specific geographic boundaries
- You need control over physical hardware for security or compliance
- You have sustained, predictable workloads where dedicated capacity is cost-effective
- You're migrating from traditional data centers and need time to adopt cloud patterns
Reality check: Building and operating a private cloud is expensive and complex. You need expertise in networking, storage, virtualization, and automation. Many organizations overestimate the benefits and underestimate the costs. Unless you have compelling regulatory or scale reasons, public cloud is usually more practical.
Hybrid Cloud
Hybrid cloud integrates public and private infrastructure, allowing workloads to move between environments based on cost, performance, or compliance requirements.
Characteristics:
- Seamless integration between on-premises and cloud environments
- Workload portability across environments
- Consistent management and security policies
- Hybrid connectivity (VPN, Direct Connect, ExpressRoute)
When to use hybrid cloud:
- You're migrating from on-premises to cloud gradually
- Some data must remain on-premises due to regulation or latency
- You want to burst to public cloud during demand spikes
- You need disaster recovery with on-premises as primary or backup
Common patterns:
- Development in cloud, production on-premises: Test and staging in public cloud for cost efficiency, production on-premises for compliance
- Data processing in cloud, data storage on-premises: Leverage cloud compute for analytics while keeping sensitive data local
- Disaster recovery: Primary workloads on-premises, fail over to cloud in disaster scenarios
Challenges: Hybrid cloud adds significant complexity - network connectivity, identity federation, security policy synchronization, and operational overhead across environments. Ensure the benefits justify this complexity. For more on connectivity patterns, see cloud provider-specific networking documentation.
Multi-Cloud
Multi-cloud means using services from multiple public cloud providers simultaneously. This is distinct from hybrid cloud (which involves on-premises infrastructure).
Motivations:
- Avoid vendor lock-in: Don't depend on a single provider's pricing, features, or availability
- Best-of-breed services: Use each provider's strengths (e.g., AWS for breadth, GCP for data analytics, Azure for Microsoft integration)
- Regulatory compliance: Meet data residency requirements by distributing workloads geographically
- Resilience: Reduce risk of provider-wide outages
- Negotiating leverage: Credible threat of switching providers can improve pricing
Reality check: Multi-cloud sounds appealing but introduces enormous complexity. Each provider has different APIs, security models, networking paradigms, and operational tools. Your team must maintain expertise across multiple platforms. Portability comes at a cost - you can't use provider-specific managed services without forfeiting portability.
When multi-cloud makes sense:
- Large enterprises with regulatory requirements for geographic distribution
- Organizations with sufficient scale to justify dedicated teams per provider
- Applications already built with high portability (Kubernetes-native, for example)
- Strategic acquisitions that bring different cloud footprints
When to avoid multi-cloud:
- Small to medium teams - the operational overhead outweighs benefits
- Applications that benefit from deep integration with managed services
- Startups optimizing for speed over resilience
Most organizations should default to a single primary cloud provider and use others selectively for specific capabilities (e.g., CDN, video encoding). True multi-cloud with workload distribution is only practical for large, mature engineering organizations.
Shared Responsibility Model
The shared responsibility model defines which security and operational tasks are the provider's responsibility and which are yours. This model applies to all cloud services but varies by service type.
Provider Responsibilities
Cloud providers are responsible for security OF the cloud - the physical infrastructure, hardware, networking, and foundational software.
What providers manage:
- Physical security: Data center access controls, surveillance, hardware disposal
- Infrastructure maintenance: Hardware replacement, network capacity, power and cooling
- Platform security: Hypervisor security, host OS patching, infrastructure-level vulnerability management
- Compliance certifications: SOC 2, ISO 27001, PCI-DSS, HIPAA (for infrastructure)
For higher-level services (PaaS, SaaS), providers take on additional responsibilities like runtime patching, database backups (if managed), and service availability.
Customer Responsibilities
Customers are responsible for security IN the cloud - everything you deploy, configure, and manage on top of the provider's infrastructure.
What you must manage:
- Data protection: Encrypting sensitive data at rest and in transit, managing encryption keys, implementing backup strategies (see Data Protection)
- Access control: Managing user identities, implementing least privilege access, configuring IAM policies (see Authorization)
- Application security: Securing application code, patching vulnerabilities in dependencies, input validation (see Input Validation)
- Network configuration: Configuring security groups, network ACLs, VPCs, and firewalls correctly
- Operating system (for IaaS): Patching OS vulnerabilities, configuring host firewalls, managing system users
- Compliance: Ensuring your use of cloud services meets regulatory requirements
Shared Responsibilities
Some areas are jointly managed, where both provider and customer have responsibilities.
Examples:
- Encryption: Provider offers encryption capabilities (KMS, encryption algorithms); you decide what to encrypt and manage keys
- Identity management: Provider supplies IAM infrastructure; you configure policies, roles, and access controls
- Network security: Provider offers DDoS protection at infrastructure level; you configure application-level rate limiting and WAF rules (see Rate Limiting)
- Patch management: Provider patches infrastructure; you patch guest OS (IaaS) and application dependencies
Why This Matters
Misunderstanding the shared responsibility model leads to security breaches. A common mistake is assuming the cloud provider secures your data automatically - they secure the infrastructure, but you must configure security controls correctly.
Real-world example: An S3 bucket (object storage) is physically secure and highly durable. But if you configure it as publicly accessible, anyone can download your data. The provider secured the storage infrastructure; you failed to configure access controls. This is your responsibility.
For detailed security implementation guidance, see our Security Overview.
Cloud-Native Principles
Cloud-native isn't just about running workloads in the cloud - it's about designing applications to leverage cloud capabilities like auto-scaling, distributed systems, and managed services.
The Twelve-Factor App
The Twelve-Factor App methodology defines best practices for building cloud-native applications. These principles ensure portability, scalability, and maintainability.
Key factors:
- Codebase: One codebase tracked in version control, many deployments (see Git Workflow)
- Dependencies: Explicitly declare and isolate dependencies (Maven, npm, Gradle)
- Config: Store configuration in environment variables, not code (see Secrets Management)
- Backing services: Treat databases, queues, and caches as attached resources accessible via URLs
- Build, release, run: Strictly separate build, release, and run stages (see Pipelines)
- Processes: Execute the app as stateless processes; store state in backing services
- Port binding: Export services via port binding (HTTP server embedded in app, not external web server)
- Concurrency: Scale out via the process model (horizontal scaling with multiple instances)
- Disposability: Fast startup and graceful shutdown for robustness and elasticity
- Dev/prod parity: Keep development, staging, and production as similar as possible
- Logs: Treat logs as event streams written to stdout (see Logging)
- Admin processes: Run admin/management tasks as one-off processes
Why these matter: These principles enable applications to scale horizontally, deploy rapidly, and run reliably in cloud environments. For example, storing state in processes prevents horizontal scaling, while embedding configuration in code requires redeployment for every config change.
Stateless Applications
Cloud-native applications should be stateless - any instance can handle any request, and no data is stored locally that can't be lost.
Stateless design patterns:
- Store session data in distributed caches (Redis, Memcached) rather than in-memory (see Caching)
- Use database transactions for state changes rather than multi-step in-memory operations
- Design for process crashes - every operation should be idempotent or transactional
- Avoid local file storage; use object storage (S3) or shared file systems for persistent data (see File Storage)
Benefits:
- Horizontal scaling: Add instances without coordination or data migration
- Resilience: Instance failures don't lose data or break user sessions
- Zero-downtime deployments: Rolling updates work seamlessly when instances are interchangeable
Microservices Architecture
Cloud-native applications are often built as microservices - small, independently deployable services that communicate via APIs. This architecture enables:
- Independent scaling: Scale services based on their specific load patterns
- Technology diversity: Use different languages/frameworks per service based on requirements
- Fault isolation: Failures in one service don't cascade to others
- Team autonomy: Small teams own specific services end-to-end
However, microservices introduce complexity around distributed systems, network reliability, and operational overhead. Start with a well-structured monolith and decompose as scale or organizational needs demand. See Microservices for detailed patterns.
Containerization and Orchestration
Containers package applications with their dependencies into portable, immutable artifacts. Orchestration platforms (Kubernetes, ECS, Cloud Run) manage container lifecycle, scaling, and networking.
Why containers matter in the cloud:
- Consistency: Same container runs identically in dev, staging, and production
- Density: Run many containers per host, maximizing resource utilization
- Speed: Fast startup times (seconds) enable rapid scaling and deployments
- Portability: Containers run on any cloud provider or on-premises
See Docker for container best practices and Kubernetes for orchestration patterns.
Global Infrastructure Concepts
Cloud providers operate globally distributed infrastructure to enable low-latency access for users worldwide and high availability through redundancy.
Regions
A region is a geographic area containing multiple isolated data centers. Each region is completely independent from other regions.
Characteristics:
- Geographic isolation: Regions are separated by hundreds of miles to protect against regional disasters
- Independent infrastructure: Separate power grids, network connectivity, and operational teams
- Compliance boundaries: Data in a region typically stays in that region (data sovereignty)
- Latency zones: Choose regions close to users for low latency
When to use multiple regions:
- Global user base: Deploy close to users in multiple geographies
- Disaster recovery: Replicate data to another region for business continuity (see Resilience)
- Compliance: Keep data in specific countries/regions due to regulatory requirements
- High availability: Protect against region-wide outages (rare but possible)
Trade-offs: Multi-region architectures add complexity - data replication, cross-region networking costs, eventual consistency challenges, and operational overhead. Only adopt multi-region when benefits justify these costs.
Availability Zones (AZs)
Availability zones are isolated data centers within a region. They provide high availability without the complexity of multi-region deployments.
Characteristics:
- Physical isolation: Separate buildings, power, cooling, and network connectivity
- Low-latency connectivity: Sub-millisecond latency between AZs in the same region
- Independent failure domains: Designed so that AZ failures don't cascade to other AZs
- Synchronous replication: Fast enough for synchronous database replication (unlike regions)
Best practices:
- Distribute resources: Run instances, containers, and databases across multiple AZs
- Design for AZ failure: Assume any AZ can fail; ensure your application continues running
- Load balancing: Use load balancers to distribute traffic across AZs automatically
Example architecture: A typical high-availability setup runs three application instances (one per AZ) behind a load balancer, with a database configured for multi-AZ failover. If one AZ fails, the load balancer routes traffic to remaining AZs, and the database fails over automatically.
For most applications, deploying across multiple AZs within a single region provides sufficient availability and disaster recovery without multi-region complexity.
Edge Locations
Edge locations are points of presence (PoPs) for content delivery networks (CDNs). They cache static content close to users for fast access.
Use cases:
- Static asset delivery: Images, CSS, JavaScript, videos
- API acceleration: Route API requests through optimized network paths
- DDoS protection: Absorb malicious traffic at the edge before it reaches your infrastructure
CDNs are essential for global web applications, reducing latency and bandwidth costs. See Performance Optimization for caching strategies.
Cloud Cost Models
Cloud pricing is fundamentally different from traditional capital expenditure. Understanding cost models is essential for financial planning and optimization.
Pay-As-You-Go
The default cloud pricing model charges for actual resource usage (compute hours, storage GB, network transfer).
Advantages:
- No upfront costs: Start small and scale as needed
- Elasticity: Pay more during high-traffic periods, less during quiet periods
- Experimentation: Try new technologies without capital approval
- Granular billing: Understand costs per project, team, or customer
Challenges:
- Unpredictable costs: Traffic spikes can cause unexpected bills
- Waste: Unused resources (forgotten instances, over-provisioned capacity) still incur costs
- Complexity: Hundreds of services with different pricing dimensions
Best practices:
- Implement cost alerts and budgets to catch unexpected increases
- Tag resources by project, environment, and team for cost allocation
- Review and shut down unused resources regularly
- Use auto-scaling to match capacity to demand
Reserved Capacity
For predictable workloads, reserved capacity (reserved instances, committed use discounts) offers significant discounts (30-70%) in exchange for multi-year commitments.
When to use reserved capacity:
- Steady-state workloads (databases, baseline compute capacity)
- Long-term projects with stable resource needs
- Cost optimization after initial deployment proves resource requirements
Risks:
- Commitment inflexibility - you pay whether you use the capacity or not
- Technology changes may make reservations obsolete (e.g., moving to serverless)
- Organizational changes (project cancellations, team restructures) can strand reservations
Strategy: Start with on-demand pricing to establish baseline usage, then reserve capacity for the stable baseline while using on-demand or spot for variable workloads.
Spot / Preemptible Instances
Spot instances offer steep discounts (up to 90%) for workloads that tolerate interruptions. Providers reclaim capacity when needed, terminating your instances with short notice (typically 30-120 seconds).
Ideal for:
- Batch processing and data analysis
- Fault-tolerant distributed systems (Spark, Hadoop)
- CI/CD build workers (see Pipelines)
- Development and test environments
Not suitable for:
- Stateful applications without checkpointing
- Latency-sensitive interactive workloads
- Databases or primary application servers
Implementation patterns:
- Checkpoint progress regularly so restarted work can resume
- Use spot for worker nodes with on-demand for critical components
- Implement graceful shutdown handlers to save state on termination notice
Data Transfer Costs
Data transfer between cloud regions, out to the internet, and sometimes between services within a region incurs charges. This is often overlooked but can be significant.
Cost optimization strategies:
- Use CDNs to cache content near users, reducing origin traffic
- Keep communicating services in the same region/AZ when possible
- Use VPC endpoints or private networking to avoid internet transfer charges
- Compress data before transfer
- Batch operations to reduce API call frequency
For detailed cost optimization strategies, see provider-specific cost documentation.
Cloud Migration Strategies
Moving existing applications to the cloud requires a strategy that balances speed, risk, and long-term benefits. The "6 Rs" framework helps categorize migration approaches.
Rehost (Lift and Shift)
Move applications to the cloud with minimal changes - copy virtual machine images or re-install software on cloud VMs.
Advantages:
- Fastest migration path
- Lowest risk - application behavior doesn't change
- Immediate infrastructure benefits (scalability, managed hardware)
- Can optimize later after migration
Disadvantages:
- Doesn't leverage cloud-native capabilities (auto-scaling, managed services)
- Carries technical debt from legacy architecture
- May cost more than on-premises without optimization
When to use:
- Time-constrained data center exit
- Applications with unclear documentation or expertise
- Stable applications unlikely to need significant changes
- First step in a phased modernization plan
Replatform (Lift and Reshape)
Migrate applications with minor optimizations to take advantage of cloud services - replace self-managed databases with managed services, use load balancers, implement auto-scaling.
Advantages:
- Moderate effort with significant operational benefits
- Reduce operational overhead (managed databases, patching)
- Improve availability and scalability
- Keep application logic mostly unchanged
Disadvantages:
- Requires configuration changes and testing
- May expose architectural issues (tight coupling, stateful designs)
- Doesn't fully leverage cloud-native patterns
When to use:
- Applications with clear upgrade paths (self-managed PostgreSQL → managed RDS)
- Opportunities to eliminate undifferentiated heavy lifting (database administration, load balancer management)
- Moderate risk tolerance with medium timeline
Example: Migrating a Java application from on-premises Tomcat and PostgreSQL to cloud VMs with managed PostgreSQL database service and application load balancer. The application code is mostly unchanged, but you've eliminated database administration and gained automatic failover.
Refactor (Re-architect)
Redesign applications to be cloud-native - adopt microservices, containerization, serverless, managed services, and modern development practices.
Advantages:
- Maximize cloud benefits (elasticity, cost optimization, resilience)
- Improve development velocity with modern tooling and practices (see Spring Boot, React)
- Enable continuous delivery and frequent deployments (see Pipelines)
- Better alignment with business needs through agility
Disadvantages:
- Highest cost and timeline
- Significant technical risk requiring deep expertise
- Requires rewriting code and changing architecture
- Team must learn new patterns and technologies
When to use:
- Legacy applications blocking business agility
- Significant technical debt prevents further development
- Business drivers justify investment (scaling, new features, cost optimization)
- Opportunity to modernize technology stack
Example: Decomposing a monolithic Java EE application into Spring Boot microservices running on Kubernetes, with PostgreSQL replaced by managed databases, REST APIs for inter-service communication (see API Design), and event-driven patterns for async workflows (see Event-Driven Architecture).
Repurchase (Replace with SaaS)
Replace custom-built applications with commercial SaaS products - CRM, HR systems, authentication, payment processing.
Advantages:
- Eliminate custom code maintenance
- Get enterprise features (compliance, SSO, audit logs) out-of-the-box
- Faster time to value
- Predictable subscription costs
Disadvantages:
- Vendor lock-in and dependency
- Limited customization
- Data migration from existing systems
- Potential loss of competitive differentiation
When to use:
- Non-differentiating capabilities (email, CRM, HR)
- Business pressures to reduce IT headcount
- Existing system is outdated with no upgrade path
- Compliance or security features are needed quickly
Retain (Keep On-Premises)
Decide explicitly not to migrate certain applications - keep them on-premises or in existing hosting.
When to retain:
- Regulatory constraints prevent cloud usage
- Application has short remaining lifespan (< 1 year)
- Migration risks or costs outweigh benefits
- Application requires hardware dependencies not available in cloud (specialized equipment)
Retire (Decommission)
Shut down applications that are no longer needed.
Benefits:
- Reduce operational costs and complexity
- Improve security by eliminating old, unpatched systems
- Simplify infrastructure and focus resources
Cloud migration projects often reveal unused or redundant applications. Decommissioning them before migration saves money and effort.
Multi-Cloud vs Single-Cloud Strategy
Choosing between single-cloud and multi-cloud is a strategic decision with profound implications for architecture, operations, and cost.
Single-Cloud Strategy
Commit primarily to one cloud provider, using their services deeply and taking advantage of managed offerings.
Advantages:
- Deep integration: Use provider-specific managed services (databases, queues, AI/ML, analytics) without compatibility layers
- Operational simplicity: One set of APIs, tools, security models, and billing
- Lower cost: Avoid duplication and abstraction overhead; qualify for volume discounts
- Team expertise: Build deep knowledge of one platform rather than surface knowledge of many
Disadvantages:
- Vendor lock-in: Switching providers becomes expensive due to proprietary service dependencies
- Price sensitivity: Limited negotiating leverage with a single vendor
- Provider outage risk: Complete dependence on one provider's availability
When to choose single-cloud:
- Small to medium engineering teams
- Applications benefit from managed service integration
- Speed and simplicity are priorities
- Cost optimization matters more than theoretical portability
Multi-Cloud Strategy
Distribute workloads across multiple cloud providers to avoid dependency on any single vendor.
Advantages:
- Avoid vendor lock-in: Credible exit option provides negotiating leverage
- Best-of-breed services: Use each provider's strengths selectively
- Geographic compliance: Meet data residency requirements through provider choice
- Resilience: Mitigate risk of provider-wide outages
Disadvantages:
- Operational complexity: Multiple APIs, tools, security models, billing systems
- Higher costs: Duplication, abstraction layers, cross-cloud networking, multiple teams
- Limited managed service usage: Portability requires using lowest-common-denominator services (VMs, containers, object storage)
- Team expertise dilution: Engineers must know multiple platforms
When to choose multi-cloud:
- Large enterprises with dedicated platform teams
- Regulatory requirements mandate geographic distribution across providers
- Existing organizational complexity (acquisitions, divisions with different providers)
- Strategic priority to avoid vendor dependency justifies cost
Practical Middle Ground
Most organizations should start with a primary cloud provider and be strategic about multi-cloud.
Pragmatic multi-cloud patterns:
- Primary cloud + CDN: Use one provider for compute/data, another for content delivery (CDN providers are interchangeable)
- Primary cloud + specialty services: Use one provider for infrastructure, another for specific capabilities (AI/ML, video processing) not easily replicated
- Kubernetes for portability: Use Kubernetes as an abstraction layer, but accept that storage, networking, and managed services are still provider-specific
Avoid:
- Active-active workloads distributed across providers for "resilience" - the complexity and cost rarely justify the marginal availability improvement
- Building custom abstraction layers to make code portable - you'll spend more on abstraction than you'd save by switching
When NOT to Use Cloud
Cloud computing is powerful but not universally appropriate. Understand when on-premises or hybrid solutions are better.
Regulatory and Data Sovereignty Constraints
Some regulations require data to remain in specific jurisdictions or prohibit third-party processing.
Examples:
- Financial regulations requiring on-premises processing for certain transaction types
- Healthcare data laws (HIPAA) requiring specific controls not easily demonstrated in shared infrastructure
- Government contracts mandating on-premises or government-only cloud environments
Options:
- Use public cloud regions within required jurisdictions
- Implement hybrid cloud with sensitive data on-premises
- Use provider compliance certifications (FedRAMP, HIPAA-eligible services) if acceptable
Cost at Scale
For sustained, predictable workloads at very large scale, owning hardware can be cheaper than renting compute capacity.
Break-even considerations:
- Cloud economics favor variable workloads - idle capacity is waste in owned data centers
- Capital costs, facilities management, and operational staff are significant
- Cloud providers' economies of scale often outweigh your ability to buy cheaper hardware
Reality: Few organizations have sufficient scale, expertise, and consistent workloads to justify building data centers. If you're not running tens of thousands of servers with high utilization, cloud is likely cheaper.
Specialized Hardware Requirements
Applications requiring custom hardware (GPUs for specific workloads, FPGAs, exotic storage arrays) may not have cloud equivalents.
Options:
- Check for cloud provider specialty instances (GPU instances, high-memory, high-storage)
- Use hybrid approach - specialized hardware on-premises, standard workloads in cloud
- Evaluate whether requirements are truly specialized or based on legacy assumptions
Latency-Sensitive Applications
If you need single-digit millisecond latency to on-premises systems or hardware, cloud networking delays may be unacceptable.
Examples:
- High-frequency trading systems colocated with exchanges
- Industrial control systems requiring real-time response to sensors
- Legacy applications with tight coupling to on-premises systems during migration
Options:
- Use cloud edge computing services to get closer to data sources
- Implement hybrid connectivity with dedicated links (Direct Connect, ExpressRoute)
- Re-architect to tolerate higher latency or use async communication patterns
Cloud vs On-Premises Decision Framework
Use this framework to evaluate whether cloud, on-premises, or hybrid is appropriate for specific workloads.
Decision criteria:
- Compliance and regulation: Can data and processing occur in public cloud?
- Workload characteristics: Variable traffic benefits from cloud elasticity; steady traffic may be cheaper on-premises at scale
- Organizational capability: Do you have expertise to build and operate data centers?
- Speed and agility: Cloud enables faster development and deployment
- Cost sensitivity: Analyze total cost of ownership (TCO) including staff, facilities, and opportunity cost
For most modern applications, cloud is the right default choice. Deviate only when specific constraints justify the complexity of on-premises infrastructure.
Anti-Patterns and Common Mistakes
Lift and Shift Without Optimization
Problem: Migrating applications to cloud VMs without leveraging cloud capabilities results in high costs and operational overhead without benefits.
Solution: Treat migration as an opportunity to modernize - adopt managed services, auto-scaling, and cloud-native patterns even if not a complete refactor.
Over-Engineering for Portability
Problem: Building custom abstraction layers to avoid "vendor lock-in" adds complexity and cost without commensurate benefit. You spend more on abstraction than you'd save by switching.
Solution: Use cloud-native services that provide value. Design for portability only if you have a specific, near-term plan to change providers. Use open standards (Kubernetes, OpenAPI) where practical, but don't avoid managed services entirely.
Ignoring Cost Management
Problem: Treating cloud as unlimited resources leads to cost overruns - unused resources, over-provisioned capacity, inefficient architectures.
Solution: Implement cost allocation tags, budgets, and alerts from day one. Review spending regularly, shut down unused resources, and right-size instances. Foster a cost-conscious culture.
Multi-Cloud for the Wrong Reasons
Problem: Adopting multi-cloud for theoretical resilience or to "avoid lock-in" without considering operational complexity.
Solution: Choose multi-cloud only when benefits (geographic compliance, provider outage risk mitigation at massive scale) clearly outweigh the substantial costs and complexity.
Neglecting Security Shared Responsibility
Problem: Assuming the cloud provider secures your data and applications automatically. Misconfigured access controls, public storage buckets, and unencrypted data are common.
Solution: Understand the shared responsibility model. Implement least privilege access, encrypt sensitive data, configure network security correctly, and audit configurations regularly (see Security Overview).
Single-AZ Deployments
Problem: Running applications in a single availability zone to save costs or complexity, eliminating resilience against AZ failures.
Solution: Deploy across multiple AZs within a region for high availability. The marginal cost and complexity are small compared to downtime risk.
Further Reading
Books and Articles
- "Cloud Native Patterns" by Cornelia Davis: Design patterns for resilient cloud applications
- "The Practice of Cloud System Administration" by Limoncelli et al.: Operational best practices for cloud infrastructure
- AWS Well-Architected Framework / Azure Well-Architected Framework / Google Cloud Architecture Framework: Provider-specific best practices
External Resources
Related Guidelines
- Microservices Architecture - Designing distributed systems
- Docker Best Practices - Containerization fundamentals
- Kubernetes Patterns - Container orchestration
- Terraform Infrastructure as Code - Automating infrastructure provisioning
- Security Overview - Securing cloud applications
- Observability - Monitoring distributed systems
- Event-Driven Architecture - Async communication patterns
- API Design - Designing cloud-native APIs
- Spring Boot Resilience - Building resilient services