Cell-Based Architecture on AWS
Cell-based architecture is a resilience pattern that partitions your infrastructure and application into isolated, independently operating units called "cells." Each cell is a complete, self-contained deployment of your application stack that can handle a subset of your total traffic. When one cell fails, the blast radius is limited to only that cell's traffic - other cells continue operating normally.
This architectural pattern emerged from companies running large-scale distributed systems where traditional approaches to fault tolerance proved insufficient. The fundamental insight is that reducing the scope of failure is often more effective than trying to prevent all failures. By constraining the impact of any single failure to a bounded cell, you achieve better overall availability than attempting to build a perfectly reliable monolithic system.
The key principle underlying cell-based architecture is failure domain isolation: every component, from compute to databases to network infrastructure, exists only within a single cell. Dependencies between cells are minimized or eliminated entirely. This creates a "bulkhead" effect where failure in one compartment cannot flood others.
Why Cell-Based Architecture?
Traditional high-availability patterns like redundant servers, load balancing, and database replication provide resilience against individual component failures, but they share a common weakness: correlated failures. A bad code deployment, a configuration error, a cascading overload, or a regional outage can affect all instances simultaneously.
The Blast Radius Problem
Consider a standard multi-AZ deployment in AWS:
Failure scenarios with full blast radius:
- Bad deployment: New code has a critical bug → all instances fail simultaneously
- Database corruption: Logic error corrupts the shared database → entire application affected
- Configuration error: Wrong environment variable deployed to all instances → complete outage
- Resource exhaustion: Traffic spike overwhelms the single RDS instance → all app servers blocked
- Regional event: AWS region-level issue → entire application offline
In each case, the blast radius is 100% of your users. Cell-based architecture limits this to a fraction.
How Cells Limit Blast Radius
With cell-based architecture, the same failure affects only one cell:
If Cell 1 experiences a bad deployment, database corruption, or configuration error, only 33% of users are affected. Cells 2 and 3 continue serving traffic normally. The system maintains 67% availability even during the incident.
Mathematical advantage: With N cells, a single-cell failure affects at most 1/N of your traffic. With 10 cells, you maintain 90% availability even when one cell is completely down.
Cell Design Patterns
Region-Based Cells
Each AWS region becomes a cell. This provides the strongest isolation (separate physical infrastructure) and is the simplest to implement, but provides the coarsest granularity.
Advantages:
- Geographic diversity: Protection against regional outages, natural disasters
- Regulatory compliance: Data sovereignty requirements (EU data stays in EU)
- Latency optimization: Route users to nearest region
- Strong isolation: Completely separate AWS infrastructure
Disadvantages:
- Higher cost: Full infrastructure in each region (compute, databases, networking)
- Data replication complexity: Cross-region data synchronization has latency and consistency challenges
- Coarse granularity: Smallest failure domain is an entire region (thousands of users)
When to use: Global applications with geographic distribution requirements, regulatory compliance needs, or budget for multi-region deployment.
See Disaster Recovery for multi-region active-active and active-passive patterns.
Availability Zone-Based Cells
Cells are defined by AWS availability zones within a single region. This provides finer granularity than region-based cells while maintaining strong physical isolation.
Advantages:
- Lower latency: Cross-AZ communication within a region is fast (<2ms typically)
- Lower cost: Data transfer within a region is cheaper than cross-region
- Balanced isolation: AZs are physically separate (different buildings, power, networking)
- Simpler data replication: Lower latency makes synchronous replication feasible
Disadvantages:
- Regional dependency: Regional AWS issues affect all cells
- Limited geographic diversity: All cells in one geographic area
- Cost of multi-AZ databases: Running separate RDS instances per AZ increases database costs
When to use: Applications that need strong isolation within a region, or when multi-region cost is prohibitive but you still want cell-based resilience.
Shard-Based Cells
Cells are logical partitions based on data sharding (e.g., customer ID ranges, tenant ID). Multiple cells can exist within the same region and even the same AZ, but each cell has dedicated infrastructure.
Advantages:
- Fine-grained control: Adjust cell size and count as traffic grows
- Efficient resource usage: Size cells based on actual load, not geography
- Easy to add cells: Add new cells without deploying to new regions
- Cost-effective: Can run multiple cells in same region/AZ with less overhead
Disadvantages:
- Complex routing: Need application-level routing logic to direct users to cells
- Shared fate: Cells in the same AZ share infrastructure failure risks
- Data partitioning challenges: Sharding strategy must be chosen carefully to balance load
- No geographic diversity: All cells vulnerable to regional issues
When to use: High-scale applications where fine-grained blast radius control is critical, or multi-tenant SaaS where tenants can be isolated into cells.
Hybrid Approaches
Real-world architectures often combine multiple cell types:
Example strategy:
- Primary cells by region (geographic resilience)
- Secondary cells by shard within each region (blast radius control)
- Result: Regional outage affects only one primary cell; bad deployment affects only one shard
This layered approach provides both geographic diversity and fine-grained failure isolation.
Routing Strategies
Routing users to cells is a critical design decision. The routing mechanism determines which users are affected by cell failures and how quickly you can shift traffic.
Route 53 Routing Policies
AWS Route 53 provides several routing policies suitable for cell-based architectures. See Route 53 DNS for detailed configuration.
Weighted Routing
Distribute traffic across cells by percentage:
Cell 1: 33% of traffic
Cell 2: 33% of traffic
Cell 3: 34% of traffic
How it works: Route 53 returns a cell's endpoint with probability proportional to its weight. A user who receives Cell 1's IP address will consistently route to Cell 1 (DNS caching ensures sticky routing).
Code example:
# Terraform configuration for weighted routing
resource "aws_route53_record" "cell_1" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
weighted_routing_policy {
weight = 33
}
set_identifier = "cell-1"
alias {
name = aws_lb.cell_1_alb.dns_name
zone_id = aws_lb.cell_1_alb.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "cell_2" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
weighted_routing_policy {
weight = 33
}
set_identifier = "cell-2"
alias {
name = aws_lb.cell_2_alb.dns_name
zone_id = aws_lb.cell_2_alb.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "cell_3" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
weighted_routing_policy {
weight = 34
}
set_identifier = "cell-3"
alias {
name = aws_lb.cell_3_alb.dns_name
zone_id = aws_lb.cell_3_alb.zone_id
evaluate_target_health = true
}
}
evaluate_target_health = true is critical: if a cell's ALB health checks fail, Route 53 automatically stops routing traffic to that cell. This provides automatic failover.
Advantages:
- Simple to configure and understand
- Automatic failover via health checks
- Gradual traffic shifts (change weights to drain a cell)
Disadvantages:
- No stickiness guarantee: DNS TTL expiration can move users between cells
- Cannot control which specific users go to which cell
- DNS caching delays propagation of weight changes (typically 60-300 seconds)
When to use: Stateless applications, or stateful applications with cross-cell session storage (see Data Partitioning).
Geolocation Routing
Route users based on geographic location:
Users in North America → Cell 1 (us-east-1)
Users in Europe → Cell 2 (eu-west-1)
Users in Asia → Cell 3 (ap-southeast-1)
Default → Cell 1
Advantages:
- Latency optimization (users routed to nearest cell)
- Data sovereignty compliance (EU users stay in EU)
- Predictable cell assignment per geography
Disadvantages:
- Uneven load distribution (depends on user geography)
- Requires multi-region deployment
- Failover requires updating DNS or using health checks to redirect entire geographies
When to use: Global applications with regional compliance requirements or latency sensitivity.
See Route 53 DNS for examples.
Latency-Based Routing
Route users to the cell with the lowest network latency:
User in Boston → Cell in us-east-1 (10ms latency)
User in Boston → Cell in eu-west-1 (80ms latency)
Result: Route to us-east-1
AWS Route 53 measures latency from users' DNS resolvers to your cells and selects the lowest-latency option.
Advantages:
- Best user experience (automatic latency optimization)
- Adapts to changing network conditions
- No manual geography mapping required
Disadvantages:
- Unpredictable load distribution
- DNS resolver location may not match user location
- Harder to reason about which users are in which cell
When to use: Global applications prioritizing user experience where predictable cell assignment is less important.
Application-Layer Routing
For shard-based cells, routing happens at the application layer rather than DNS.
Consistent Hashing
Map users to cells using a hash function:
public class CellRouter {
private final List<String> cellEndpoints;
private final ConsistentHash<String> hashRing;
public CellRouter(List<String> cellEndpoints) {
this.cellEndpoints = cellEndpoints;
// Create consistent hash ring with virtual nodes for better distribution
this.hashRing = new ConsistentHash<>(
Hashing.murmur3_128(),
100, // virtual nodes per cell
cellEndpoints
);
}
/**
* Determine which cell should handle this user.
* Same user always routes to same cell (unless cells are added/removed).
*/
public String getCellForUser(String userId) {
return hashRing.get(userId);
}
/**
* Get the cell endpoint URL for API calls.
*/
public String getCellEndpoint(String userId) {
String cellId = getCellForUser(userId);
return "https://" + cellId + ".api.example.com";
}
}
Consistent hashing properties:
- Deterministic: Same user always routes to same cell
- Minimal disruption: Adding/removing cells only remaps ~1/N of users (N = number of cells)
- Even distribution: Virtual nodes ensure balanced load across cells
Disadvantages:
- Sticky failures: If a user's cell is down, that user cannot access the system (no automatic failover)
- Complex implementation: Requires custom routing logic in the application
- Client-side routing: Mobile/web clients must know the routing algorithm, or use a routing service
Mitigation for sticky failures: Implement fallback routing:
public String getCellEndpointWithFallback(String userId) {
String primaryCell = getCellForUser(userId);
// Check if primary cell is healthy
if (healthChecker.isHealthy(primaryCell)) {
return "https://" + primaryCell + ".api.example.com";
}
// Fall back to next cell in hash ring
String fallbackCell = hashRing.getNext(userId);
logger.warn("Cell {} unhealthy for user {}, failing over to {}",
primaryCell, userId, fallbackCell);
return "https://" + fallbackCell + ".api.example.com";
}
This trades off stickiness (user might move cells during incident) for availability (user can still access system).
API Gateway with Lambda Authorizer
Use AWS API Gateway with a Lambda authorizer to route requests:
Lambda Authorizer code:
// Lambda authorizer for cell routing
import { APIGatewayTokenAuthorizerEvent, APIGatewayAuthorizerResult } from 'aws-lambda';
import * as jwt from 'jsonwebtoken';
const CELL_ENDPOINTS = [
'https://cell1.internal.example.com',
'https://cell2.internal.example.com',
'https://cell3.internal.example.com',
];
function hashUserId(userId: string): number {
// Simple hash - use MurmurHash or similar for production
let hash = 0;
for (let i = 0; i < userId.length; i++) {
hash = ((hash << 5) - hash) + userId.charCodeAt(i);
hash = hash & hash; // Convert to 32-bit integer
}
return Math.abs(hash);
}
export async function handler(event: APIGatewayTokenAuthorizerEvent): Promise<APIGatewayAuthorizerResult> {
try {
// Verify JWT token
const token = event.authorizationToken.replace('Bearer ', '');
const decoded = jwt.verify(token, process.env.JWT_SECRET!) as { userId: string };
// Route to cell based on user ID
const cellIndex = hashUserId(decoded.userId) % CELL_ENDPOINTS.length;
const cellEndpoint = CELL_ENDPOINTS[cellIndex];
return {
principalId: decoded.userId,
policyDocument: {
Version: '2012-10-17',
Statement: [{
Action: 'execute-api:Invoke',
Effect: 'Allow',
Resource: event.methodArn,
}],
},
context: {
userId: decoded.userId,
cellEndpoint: cellEndpoint, // Pass to integration
},
};
} catch (error) {
throw new Error('Unauthorized');
}
}
Advantages:
- Centralized routing logic (easier to update)
- Transparent to clients (just use one API Gateway URL)
- Can implement complex routing rules (A/B testing, canary rollouts)
Disadvantages:
- API Gateway cost (per request)
- Lambda authorizer latency (though caching helps)
- Additional moving parts
When to use: When you need centralized control over routing, or clients cannot implement routing logic.
See API Gateway for detailed integration patterns.
Data Partitioning Across Cells
Data partitioning is the most challenging aspect of cell-based architecture. The goal is to ensure each cell can operate independently without relying on data in other cells.
Cell-Local Data (Sharding)
Each cell stores a subset of the total dataset. A user's data lives entirely in one cell.
Sharding strategy: Determine the partition key (e.g., user ID, tenant ID) and distribution algorithm (e.g., hash, range).
Example: Hash-based sharding:
// Determine which cell's database contains this user's data
public class DataPartitioner {
private final int numberOfCells;
public int getCellNumber(String userId) {
return Math.abs(userId.hashCode()) % numberOfCells;
}
public String getDatabaseEndpoint(String userId) {
int cellNumber = getCellNumber(userId);
return String.format("cell-%d-db.internal.example.com", cellNumber);
}
}
Advantages:
- True isolation: one cell's data corruption doesn't affect others
- Scalable: add cells to increase capacity
- No cross-cell dependencies during normal operation
Disadvantages:
- Cannot query across shards: Global queries (e.g., "all users") require fanning out to all cells
- Rebalancing complexity: Adding cells requires data migration
- Hotspots: Uneven distribution can overload some cells (e.g., one celebrity user overwhelms a cell)
Handling hotspots: Monitor database load per cell. If a cell is consistently hot:
- Vertical scaling: Increase that cell's database size temporarily
- Sub-sharding: Split the hot shard into multiple cells
- Consistent hashing with virtual nodes: Improves distribution uniformity
See Database Design for sharding patterns and AWS Databases for RDS/Aurora configuration.
Replicated Global Data
Some data must be accessible from all cells (e.g., product catalog, configuration). Replicate this data to each cell.
Replication strategies:
- Aurora Global Database: Primary in one region, read replicas in other regions (cross-region cells)
- DynamoDB Global Tables: Multi-region active-active replication (eventual consistency)
- Event-driven replication: Publish changes to EventBridge/SNS, consume in each cell
Example: Event-driven replication:
// Product service publishes product updates to EventBridge
@Service
public class ProductService {
private final EventBridgeClient eventBridge;
public void updateProduct(Product product) {
productRepository.save(product);
// Publish event for cross-cell replication
PutEventsRequestEntry event = PutEventsRequestEntry.builder()
.source("product-service")
.detailType("ProductUpdated")
.detail(objectMapper.writeValueAsString(product))
.eventBusName("global-product-events")
.build();
eventBridge.putEvents(r -> r.entries(event));
}
}
// Each cell's product service consumes events to update local replica
@Component
public class ProductReplicationConsumer {
@SqsListener(queueNames = "${product.replication.queue}")
public void handleProductUpdate(ProductUpdatedEvent event) {
// Update local read replica
productReadRepository.upsert(event.getProduct());
}
}
Consistency model: Replicated data is eventually consistent. Updates propagate within seconds to minutes. Application logic must tolerate stale reads.
When to use:
- Read-heavy reference data (products, configurations)
- Data that changes infrequently
- Acceptable for reads to be slightly stale
See Event-Driven Architecture for event propagation patterns and AWS Messaging for EventBridge details.
Cross-Cell Session Storage
For stateful applications, session data must be accessible across cells to handle DNS routing changes or failover.
Options:
1. Centralized Session Store (ElastiCache Global Datastore)
Spring Boot configuration:
@Configuration
public class SessionConfig {
@Bean
public LettuceConnectionFactory redisConnectionFactory() {
RedisStandaloneConfiguration config = new RedisStandaloneConfiguration(
"cell-global-sessions.cache.amazonaws.com", 6379
);
return new LettuceConnectionFactory(config);
}
@Bean
public RedisTemplate<String, Object> redisTemplate() {
RedisTemplate<String, Object> template = new RedisTemplate<>();
template.setConnectionFactory(redisConnectionFactory());
return template;
}
}
// Use Spring Session for automatic session management
@Service
public class UserSessionService {
private final RedisTemplate<String, Object> redisTemplate;
public void storeSession(String sessionId, UserSession session) {
redisTemplate.opsForValue().set(
"session:" + sessionId,
session,
Duration.ofHours(24)
);
}
public UserSession getSession(String sessionId) {
return (UserSession) redisTemplate.opsForValue().get("session:" + sessionId);
}
}
Advantages:
- Users can seamlessly switch cells (DNS routing change, failover)
- Centralized session visibility for debugging
Disadvantages:
- Cross-cell dependency: If the global session store is down, all cells cannot authenticate users (defeats cell isolation)
- Latency: Cross-region reads for session data add 50-200ms
- Cost: Global replication infrastructure
Mitigation: Use with graceful degradation. If session store is unavailable, fall back to requiring re-authentication.
2. Shared-Nothing with JWT Tokens
Eliminate server-side session storage entirely. Store session data in JWT tokens:
@Service
public class JwtSessionService {
@Value("${jwt.secret}")
private String jwtSecret;
/**
* Create a JWT containing all session data.
* No server-side storage required.
*/
public String createSessionToken(User user) {
return Jwts.builder()
.setSubject(user.getId())
.claim("email", user.getEmail())
.claim("roles", user.getRoles())
.setIssuedAt(new Date())
.setExpiration(Date.from(Instant.now().plus(24, ChronoUnit.HOURS)))
.signWith(SignatureAlgorithm.HS256, jwtSecret)
.compact();
}
/**
* Validate and extract session data from JWT.
* Works in any cell - no database lookup needed.
*/
public UserSession validateAndExtract(String token) {
Claims claims = Jwts.parser()
.setSigningKey(jwtSecret)
.parseClaimsJws(token)
.getBody();
return new UserSession(
claims.getSubject(),
claims.get("email", String.class),
claims.get("roles", List.class)
);
}
}
Advantages:
- Perfect cell isolation: No shared session infrastructure
- Scalable: No session storage to manage
- Fast: No database/cache lookup per request
Disadvantages:
- Token size: Including all session data increases token size (impacts request size)
- Cannot invalidate: Once issued, JWT is valid until expiration (cannot force logout)
- Sensitive data exposure: Session data is visible to client (though signed to prevent tampering)
When to use: Stateless applications, or applications where forced logout is rare.
See Authentication for JWT patterns.
Cell Isolation Mechanisms
True cell isolation requires separating all infrastructure layers.
Network Isolation
Each cell has its own VPC or isolated subnets:
# Terraform: Create isolated VPC per cell
resource "aws_vpc" "cell" {
count = var.number_of_cells
cidr_block = "10.${count.index}.0.0/16"
tags = {
Name = "cell-${count.index}-vpc"
Cell = count.index
}
}
# Each cell has private subnets across AZs
resource "aws_subnet" "cell_private" {
count = var.number_of_cells * 3 # 3 AZs per cell
vpc_id = aws_vpc.cell[floor(count.index / 3)].id
cidr_block = "10.${floor(count.index / 3)}.${(count.index % 3) * 64}.0/18"
availability_zone = data.aws_availability_zones.available.names[count.index % 3]
tags = {
Name = "cell-${floor(count.index / 3)}-private-${count.index % 3}"
Cell = floor(count.index / 3)
}
}
Benefits:
- Network-level blast radius containment (one VPC misconfiguration doesn't affect other cells)
- Security group isolation (cannot accidentally open cross-cell traffic)
- Independent network troubleshooting
Cross-cell communication: Minimize but not eliminate. Use VPC peering or Transit Gateway only for essential global services (e.g., centralized logging).
See AWS Networking for VPC design patterns.
Compute Isolation
Separate ECS clusters, EKS clusters, or auto-scaling groups per cell:
# Example: Separate EKS cluster per cell
apiVersion: v1
kind: ConfigMap
metadata:
name: cell-config
namespace: default
data:
CELL_ID: "cell-1"
CELL_REGION: "us-east-1"
DATABASE_ENDPOINT: "cell-1-db.internal.example.com"
CACHE_ENDPOINT: "cell-1-cache.internal.example.com"
# Only talk to resources within this cell
Why separate clusters:
- Control plane isolation (Kubernetes API server failure affects only one cell)
- Independent cluster upgrades (roll out new Kubernetes version to one cell at a time)
- Resource quotas per cell (prevent one cell from starving others)
See AWS EKS for cluster design and AWS Compute for ECS patterns.
Database Isolation
Each cell has its own database instance (RDS, Aurora, DynamoDB table):
# Terraform: Separate RDS instance per cell
resource "aws_db_instance" "cell" {
count = var.number_of_cells
identifier = "cell-${count.index}-db"
engine = "postgres"
engine_version = "16.6"
instance_class = "db.r6g.large"
# Each cell's database in its own VPC
db_subnet_group_name = aws_db_subnet_group.cell[count.index].name
vpc_security_group_ids = [aws_security_group.cell_db[count.index].id]
# Isolated from other cells
publicly_accessible = false
tags = {
Name = "cell-${count.index}-database"
Cell = count.index
}
}
Cost consideration: Running N database instances costs N times as much as a single instance. Mitigate by:
- Right-sizing: Size each cell's database for 1/N of traffic, not full traffic
- Serverless databases: Aurora Serverless v2 scales down during low traffic
- Reserved instances: Commit to cell databases for cost savings
See AWS Databases for RDS optimization.
Deployment Isolation
Deploy changes to one cell at a time (canary deployment at cell level):
# GitLab CI: Deploy to cells sequentially
stages:
- deploy_cell_1
- validate_cell_1
- deploy_cell_2
- validate_cell_2
- deploy_cell_3
deploy_cell_1:
stage: deploy_cell_1
script:
- kubectl --context=cell-1 apply -f k8s/
- kubectl --context=cell-1 rollout status deployment/app
validate_cell_1:
stage: validate_cell_1
script:
- ./scripts/health-check.sh cell-1
- ./scripts/smoke-test.sh cell-1
# If validation fails, pipeline stops here - cells 2 and 3 not deployed
deploy_cell_2:
stage: deploy_cell_2
script:
- kubectl --context=cell-2 apply -f k8s/
- kubectl --context=cell-2 rollout status deployment/app
when: on_success # Only if cell-1 validation passed
Progressive rollout strategy:
- Deploy to Cell 1 (smallest blast radius: 1/N of users)
- Monitor metrics for 30-60 minutes
- If healthy, deploy to Cell 2
- Continue until all cells updated
If issues are detected in Cell 1, stop the rollout. Only 1/N of users affected.
See GitLab CI/CD Pipelines for detailed pipeline patterns.
Cell Health Monitoring and Failover
Health Checks
Implement comprehensive health checks at multiple layers:
@RestController
public class HealthCheckController {
private final DataSource dataSource;
private final RedisTemplate<String, String> redis;
/**
* Liveness check: Is the application running?
* Used by Kubernetes to restart crashed pods.
*/
@GetMapping("/health/live")
public ResponseEntity<Map<String, String>> liveness() {
return ResponseEntity.ok(Map.of("status", "UP"));
}
/**
* Readiness check: Can the application serve traffic?
* Used by load balancers to route traffic.
*/
@GetMapping("/health/ready")
public ResponseEntity<Map<String, Object>> readiness() {
Map<String, Object> health = new HashMap<>();
// Check database connectivity
boolean dbHealthy = checkDatabase();
health.put("database", dbHealthy ? "UP" : "DOWN");
// Check cache connectivity
boolean cacheHealthy = checkCache();
health.put("cache", cacheHealthy ? "UP" : "DOWN");
// Overall status
boolean healthy = dbHealthy && cacheHealthy;
health.put("status", healthy ? "UP" : "DOWN");
return ResponseEntity
.status(healthy ? 200 : 503)
.body(health);
}
private boolean checkDatabase() {
try (Connection conn = dataSource.getConnection()) {
return conn.isValid(5); // 5 second timeout
} catch (Exception e) {
return false;
}
}
private boolean checkCache() {
try {
redis.opsForValue().get("health-check-key");
return true;
} catch (Exception e) {
return false;
}
}
}
Health check layers:
- Application Load Balancer health checks: Check
/health/readyevery 30 seconds - Route 53 health checks: Monitor ALB endpoints, remove unhealthy cells from DNS
- CloudWatch alarms: Alert on health check failures, high error rates, latency spikes
# Terraform: Route 53 health check for cell
resource "aws_route53_health_check" "cell" {
count = var.number_of_cells
fqdn = "cell-${count.index}.api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health/ready"
failure_threshold = 3
request_interval = 30
tags = {
Name = "cell-${count.index}-health"
Cell = count.index
}
}
# CloudWatch alarm on health check failure
resource "aws_cloudwatch_metric_alarm" "cell_unhealthy" {
count = var.number_of_cells
alarm_name = "cell-${count.index}-unhealthy"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "HealthCheckStatus"
namespace = "AWS/Route53"
period = 60
statistic = "Minimum"
threshold = 1
alarm_description = "Cell ${count.index} is unhealthy"
dimensions = {
HealthCheckId = aws_route53_health_check.cell[count.index].id
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
See Observability for alerting strategies.
Automatic Failover
Route 53 with health checks provides automatic DNS-based failover:
Cell 1: HEALTHY → receives traffic
Cell 2: HEALTHY → receives traffic
Cell 3: UNHEALTHY → removed from DNS rotation
Users currently routed to Cell 3:
- New requests: DNS resolves to Cell 1 or Cell 2 only
- Existing connections: Continue until DNS TTL expires (typically 60 seconds), then reconnect to healthy cell
Failover timeline:
Total failover time: ~4 minutes (failure detection + DNS propagation + client reconnection). This is acceptable for most applications, but mission-critical systems may need faster failover.
Faster failover with client-side retries:
// TypeScript client with automatic retry to fallback cell
async function callAPI(endpoint: string, maxRetries = 2): Promise<Response> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch(endpoint, { timeout: 5000 });
if (response.ok) return response;
// Server error - might be cell failure
if (response.status >= 500) {
console.warn(`Cell returned ${response.status}, retrying...`);
continue;
}
return response; // Client error - don't retry
} catch (error) {
console.error(`Request failed (attempt ${attempt + 1}):`, error);
if (attempt === maxRetries - 1) throw error;
// Exponential backoff
await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, attempt)));
}
}
throw new Error('Max retries exceeded');
}
Clients retry failed requests immediately, achieving sub-second failover from the user's perspective.
See Spring Boot Resilience for retry patterns and circuit breakers.
Disaster Recovery with Cells
Cell-based architecture provides natural disaster recovery capabilities, but requires planning for regional failures.
Multi-Region Active-Active
Run cells in multiple AWS regions, all actively serving traffic:
Failure scenario: Entire us-east-1 region goes down.
Impact: Cells 1-3 fail (33% of cells). Route 53 health checks detect failure and remove those cells from DNS. Remaining cells (4-9) absorb the traffic.
Data considerations:
- Shard-based cells: Each user's data exists in only one region. Users whose data is in
us-east-1cannot access the system until region recovers, OR you replicate data cross-region (complex). - Replicated data: Global reference data remains available (replicated to all regions).
Trade-off: True active-active multi-region requires cross-region data replication with complex consistency management. See Disaster Recovery for multi-region strategies.
Multi-Region Active-Passive
Primary cells in one region (us-east-1), standby cells in another (us-west-2):
Normal operation: All traffic goes to us-east-1 cells. us-west-2 cells are running but idle (or scaled to zero).
Disaster scenario: us-east-1 region fails.
Recovery steps:
- Route 53 health checks detect
us-east-1cells unhealthy - Route 53 switches to failover cells in
us-west-2 - Standby cells scale up to handle full traffic load
- Database replicas in
us-west-2promoted to primary
RTO: 5-15 minutes (DNS propagation + standby scale-up + database promotion)
RPO: Depends on replication lag. Aurora Global Database provides <1 second RPO.
Cost optimization: Run standby cells at minimal capacity (single small instance) to reduce costs. Scale up only during failover.
See Disaster Recovery for RTO/RPO planning.
Deployment Strategies for Cells
Phased Rollouts
Deploy changes to cells sequentially, monitoring each before proceeding:
#!/bin/bash
# deploy-cells.sh - Progressive cell deployment with validation
CELLS=("cell-1" "cell-2" "cell-3" "cell-4" "cell-5")
for CELL in "${CELLS[@]}"; do
echo "Deploying to $CELL..."
# Deploy to cell
kubectl --context=$CELL apply -f k8s/
kubectl --context=$CELL rollout status deployment/app --timeout=5m
# Wait for deployment to stabilize
echo "Monitoring $CELL for 15 minutes..."
sleep 900
# Validate health metrics
ERROR_RATE=$(curl -s "https://$CELL.api.example.com/metrics" | jq '.error_rate')
LATENCY_P99=$(curl -s "https://$CELL.api.example.com/metrics" | jq '.latency_p99')
# Check if metrics are healthy
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "ERROR: High error rate in $CELL ($ERROR_RATE). Stopping rollout."
exit 1
fi
if (( $(echo "$LATENCY_P99 > 1000" | bc -l) )); then
echo "ERROR: High latency in $CELL (${LATENCY_P99}ms). Stopping rollout."
exit 1
fi
echo "$CELL is healthy. Proceeding to next cell."
done
echo "All cells deployed successfully!"
Benefits:
- Early detection of issues (first cell acts as canary)
- Limited blast radius (stop before deploying to all cells)
- Time to observe and react (15-minute soak time per cell)
Canary Cells
Designate one cell as a canary for new deployments:
Cell 1: Canary (5% of traffic) → Deploy here first
Cell 2: Production (31.67% of traffic)
Cell 3: Production (31.67% of traffic)
Cell 4: Production (31.67% of traffic)
Process:
- Deploy new version to Cell 1 only
- Monitor for 24-48 hours
- If healthy, deploy to remaining cells
Advantages:
- Minimal blast radius for risky changes (only 5% of users)
- Real production traffic validation (not just synthetic tests)
- Fast rollback (revert Cell 1 only)
Route 53 weighted routing for canary:
resource "aws_route53_record" "cell_1_canary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
weighted_routing_policy {
weight = 5 # 5% of traffic
}
set_identifier = "cell-1-canary"
alias {
name = aws_lb.cell_1_alb.dns_name
zone_id = aws_lb.cell_1_alb.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "cell_2_production" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
weighted_routing_policy {
weight = 32 # ~32% of traffic
}
set_identifier = "cell-2"
alias {
name = aws_lb.cell_2_alb.dns_name
zone_id = aws_lb.cell_2_alb.zone_id
evaluate_target_health = true
}
}
# Repeat for cells 3 and 4...
Blue-Green at Cell Level
Maintain two sets of cells (blue and green), switch all traffic atomically:
Deployment process:
- Green cells run new version but receive no traffic
- Test green cells with synthetic traffic
- Update Route 53 to shift traffic to green cells (change DNS weights)
- Monitor green cells under production load
- If issues arise, shift traffic back to blue cells (fast rollback)
- Once stable, decommission blue cells or repurpose as new standby
Advantages:
- Instant rollback (just change DNS)
- Full production validation before cutover
- Zero-downtime deployments
Disadvantages:
- Cost: Running double infrastructure during transition
- Data migration: Database schema changes require careful planning (must be compatible with both versions)
See GitLab CI/CD Pipelines for blue-green automation.
Common Anti-Patterns
Cells Too Small
Problem: Creating too many cells increases operational overhead without meaningful blast radius reduction.
Example: 100 cells in a system serving 10,000 users → 100 users per cell.
Why it's bad:
- 100 separate deployments to manage
- 100 separate database instances to maintain
- Diminishing returns: difference between 99% and 99.9% availability is small for most applications
- Increased costs (infrastructure overhead per cell)
Rule of thumb: Aim for 3-10 cells initially. Add more as scale demands.
Cells Too Large
Problem: Cells so large that a single cell failure is catastrophic.
Example: 2 cells in a mission-critical system → 50% of users affected by one cell failure.
Why it's bad:
- Blast radius too large (half your users impacted)
- Defeats the purpose of cell-based architecture
Rule of thumb: Ensure a single cell failure affects <20% of users (5+ cells minimum).
Tight Coupling Between Cells
Problem: Cells depend on each other for normal operation.
Example:
// BAD: Cell 1 calls Cell 2's database directly
public class PaymentService {
public PaymentResult processPayment(PaymentRequest request) {
// Cross-cell database call - creates dependency!
Account account = cell2DatabaseClient.getAccount(request.accountId);
if (account.balance < request.amount) {
return PaymentResult.rejected("Insufficient funds");
}
// ... process payment
}
}
Why it's bad:
- Cell 1 cannot operate if Cell 2 is down
- Cascading failures (Cell 2 overload affects Cell 1)
- Defeats isolation benefits
Fix: Each cell owns its data. If you need cross-cell data, replicate it or use event-driven patterns:
// GOOD: Cell has local replica of account data
public class PaymentService {
private final AccountReadRepository accountReadRepository; // Local replica
public PaymentResult processPayment(PaymentRequest request) {
// Read from local replica - no cross-cell dependency
Account account = accountReadRepository.findById(request.accountId);
if (account.balance < request.amount) {
return PaymentResult.rejected("Insufficient funds");
}
// ... process payment
}
}
Ignoring Cell Affinity
Problem: Users randomly moved between cells, breaking session state or data locality.
Example: User routed to Cell 1 at 10:00 AM, then Cell 2 at 10:05 AM due to DNS TTL changes.
Why it's bad:
- Session loss (user logged out unexpectedly)
- Inefficient database queries (user's data is in Cell 1, but Cell 2 must fetch it cross-cell or fail)
Fix: Use consistent routing (hash-based, or sticky DNS with long TTL):
// Consistent routing ensures same user always goes to same cell
public class CellRouter {
public String getCellForUser(String userId) {
// Same user ID always hashes to same cell
int cellIndex = Math.abs(userId.hashCode()) % numberOfCells;
return "cell-" + cellIndex;
}
}
Insufficient Health Checks
Problem: Health checks only verify application is running, not that it can serve traffic correctly.
Example:
// BAD: Only checks if process is alive
@GetMapping("/health")
public String health() {
return "OK"; // Always returns OK even if database is down!
}
Why it's bad:
- Cell marked healthy even though database is down
- Users routed to broken cell
- Delayed incident detection
Fix: Check all critical dependencies:
// GOOD: Comprehensive health check
@GetMapping("/health/ready")
public ResponseEntity<String> readiness() {
boolean dbHealthy = checkDatabase();
boolean cacheHealthy = checkCache();
boolean downstreamHealthy = checkDownstreamServices();
if (dbHealthy && cacheHealthy && downstreamHealthy) {
return ResponseEntity.ok("READY");
} else {
return ResponseEntity.status(503).body("NOT READY");
}
}
No Graceful Degradation
Problem: Cell failure results in hard errors rather than degraded functionality.
Example: User cannot view account balance because the account service cell is down, even though the data is cached.
Fix: Implement fallbacks:
@Service
public class AccountService {
private final AccountServiceClient accountClient;
private final RedisTemplate<String, Account> cache;
/**
* Get account with fallback to cache if primary cell is down.
*/
public Optional<Account> getAccount(String accountId) {
try {
// Try primary cell
return Optional.of(accountClient.getAccount(accountId));
} catch (Exception e) {
// Primary cell down - fall back to cached data
logger.warn("Account service unavailable, using cached data for {}", accountId);
Account cached = cache.opsForValue().get("account:" + accountId);
return Optional.ofNullable(cached);
}
}
}
Users get slightly stale data instead of complete failure.
Cost Implications
Cell-based architecture increases infrastructure costs. Optimize by:
Right-Sizing Cells
Size each cell for its share of traffic, not full capacity:
- Traditional HA: 2 large instances (active-passive) sized for 100% traffic
- Cell-based: 3 cells with smaller instances sized for 33% traffic each
Example cost comparison:
| Architecture | Instances | Instance Size | Monthly Cost |
|---|---|---|---|
| Active-Passive HA | 2 × db.r6g.2xlarge | 8 vCPU, 64 GB | $1,560 |
| 3 Cells | 3 × db.r6g.large | 2 vCPU, 16 GB | $1,170 |
Even though you run more instances, total cost is lower because each is smaller.
Aurora Serverless v2 for Variable Load
Use Aurora Serverless v2 for cells with variable traffic:
resource "aws_rds_cluster" "cell_db" {
count = var.number_of_cells
cluster_identifier = "cell-${count.index}-db"
engine = "aurora-postgresql"
engine_mode = "provisioned"
serverlessv2_scaling_configuration {
min_capacity = 0.5 # Scale down to 0.5 ACU during low traffic
max_capacity = 4.0 # Scale up to 4 ACU during peak
}
}
Cost savings: Pay for actual usage rather than peak capacity. Cells scale down automatically during low traffic periods.
Spot Instances for Non-Critical Cells
Use EC2 Spot Instances for canary cells or lower-priority traffic:
resource "aws_eks_node_group" "cell_canary_spot" {
cluster_name = aws_eks_cluster.cell_canary.name
node_group_name = "canary-spot-nodes"
scaling_config {
desired_size = 2
max_size = 5
min_size = 1
}
capacity_type = "SPOT" # Use Spot instances (60-90% cheaper)
instance_types = ["t3.large", "t3a.large", "t2.large"] # Multiple types for availability
}
Savings: 60-90% cost reduction. Acceptable for canary cells (small blast radius if Spot instances reclaimed).
Reserved Instances for Production Cells
Commit to production cells with Reserved Instances or Savings Plans:
- 1-year commitment: ~30% savings
- 3-year commitment: ~50% savings
Reserve capacity for your baseline cell infrastructure. Use On-Demand for burst capacity.
See AWS Cost Optimization for detailed cost strategies.
Summary
Cell-based architecture is a powerful resilience pattern that limits blast radius by partitioning infrastructure into isolated, independently operating cells. Key principles:
-
Failure domain isolation: Each cell is a complete stack (compute, database, network). Failures are contained to one cell.
-
Data partitioning: Users' data lives in one cell (sharding) or is replicated to all cells (global data). Minimize cross-cell dependencies.
-
Routing: Use DNS-based routing (Route 53) for simple multi-region cells, or application-layer routing (consistent hashing) for fine-grained shard-based cells.
-
Health checks and failover: Comprehensive health checks at all layers. Route 53 automatically removes unhealthy cells from DNS.
-
Deployment strategies: Deploy to cells sequentially (phased rollout) or use canary cells to minimize risk.
-
Cost optimization: Right-size cells, use serverless databases, leverage Spot instances for non-critical cells.
When to use cell-based architecture:
- Large-scale systems where availability is critical
- Applications that can tolerate eventual consistency
- Systems with natural data partitioning (multi-tenant, geographic)
When NOT to use:
- Small-scale applications (operational overhead outweighs benefits)
- Applications requiring strong cross-shard consistency
- Tightly coupled monoliths (refactor to microservices first)
Further Reading
- AWS Well-Architected Framework - Reliability Pillar
- Disaster Recovery and Business Continuity - RTO/RPO planning, multi-region strategies
- Microservices Architecture - Service decomposition and isolation patterns
- Spring Boot Resilience - Circuit breakers, retries, timeouts
- AWS Route 53 DNS - DNS routing policies and health checks
- AWS Databases - RDS, Aurora, DynamoDB patterns
- Event-Driven Architecture - Cross-cell event propagation
- Chaos Engineering - Testing cell failure scenarios