Cell-Based Architecture on AWS

Cell-based architecture is a resilience pattern that partitions your infrastructure and application into isolated, independently operating units called "cells." Each cell is a complete, self-contained deployment of your application stack that can handle a subset of your total traffic. When one cell fails, the blast radius is limited to only that cell's traffic - other cells continue operating normally.

This architectural pattern emerged from companies running large-scale distributed systems where traditional approaches to fault tolerance proved insufficient. The fundamental insight is that reducing the scope of failure is often more effective than trying to prevent all failures. By constraining the impact of any single failure to a bounded cell, you achieve better overall availability than attempting to build a perfectly reliable monolithic system.

The key principle underlying cell-based architecture is failure domain isolation: every component, from compute to databases to network infrastructure, exists only within a single cell. Dependencies between cells are minimized or eliminated entirely. This creates a "bulkhead" effect where failure in one compartment cannot flood others.

Why Cell-Based Architecture?

Traditional high-availability patterns like redundant servers, load balancing, and database replication provide resilience against individual component failures, but they share a common weakness: correlated failures. A bad code deployment, a configuration error, a cascading overload, or a regional outage can affect all instances simultaneously.

The Blast Radius Problem

Consider a standard multi-AZ deployment in AWS:

Failure scenarios with full blast radius:

Bad deployment: New code has a critical bug → all instances fail simultaneously
Database corruption: Logic error corrupts the shared database → entire application affected
Configuration error: Wrong environment variable deployed to all instances → complete outage
Resource exhaustion: Traffic spike overwhelms the single RDS instance → all app servers blocked
Regional event: AWS region-level issue → entire application offline

In each case, the blast radius is 100% of your users. Cell-based architecture limits this to a fraction.

How Cells Limit Blast Radius

With cell-based architecture, the same failure affects only one cell:

If Cell 1 experiences a bad deployment, database corruption, or configuration error, only 33% of users are affected. Cells 2 and 3 continue serving traffic normally. The system maintains 67% availability even during the incident.

Mathematical advantage: With N cells, a single-cell failure affects at most 1/N of your traffic. With 10 cells, you maintain 90% availability even when one cell is completely down.

Cell Design Patterns

Region-Based Cells

Each AWS region becomes a cell. This provides the strongest isolation (separate physical infrastructure) and is the simplest to implement, but provides the coarsest granularity.

Advantages:

Geographic diversity: Protection against regional outages, natural disasters
Regulatory compliance: Data sovereignty requirements (EU data stays in EU)
Latency optimization: Route users to nearest region
Strong isolation: Completely separate AWS infrastructure

Disadvantages:

Higher cost: Full infrastructure in each region (compute, databases, networking)
Data replication complexity: Cross-region data synchronization has latency and consistency challenges
Coarse granularity: Smallest failure domain is an entire region (thousands of users)

When to use: Global applications with geographic distribution requirements, regulatory compliance needs, or budget for multi-region deployment.

See Disaster Recovery for multi-region active-active and active-passive patterns.

Availability Zone-Based Cells

Cells are defined by AWS availability zones within a single region. This provides finer granularity than region-based cells while maintaining strong physical isolation.

Advantages:

Lower latency: Cross-AZ communication within a region is fast (<2ms typically)
Lower cost: Data transfer within a region is cheaper than cross-region
Balanced isolation: AZs are physically separate (different buildings, power, networking)
Simpler data replication: Lower latency makes synchronous replication feasible

Disadvantages:

Regional dependency: Regional AWS issues affect all cells
Limited geographic diversity: All cells in one geographic area
Cost of multi-AZ databases: Running separate RDS instances per AZ increases database costs

When to use: Applications that need strong isolation within a region, or when multi-region cost is prohibitive but you still want cell-based resilience.

Shard-Based Cells

Cells are logical partitions based on data sharding (e.g., customer ID ranges, tenant ID). Multiple cells can exist within the same region and even the same AZ, but each cell has dedicated infrastructure.

Advantages:

Fine-grained control: Adjust cell size and count as traffic grows
Efficient resource usage: Size cells based on actual load, not geography
Easy to add cells: Add new cells without deploying to new regions
Cost-effective: Can run multiple cells in same region/AZ with less overhead

Disadvantages:

Complex routing: Need application-level routing logic to direct users to cells
Shared fate: Cells in the same AZ share infrastructure failure risks
Data partitioning challenges: Sharding strategy must be chosen carefully to balance load
No geographic diversity: All cells vulnerable to regional issues

When to use: High-scale applications where fine-grained blast radius control is critical, or multi-tenant SaaS where tenants can be isolated into cells.

Hybrid Approaches

Real-world architectures often combine multiple cell types:

Example strategy:

Primary cells by region (geographic resilience)
Secondary cells by shard within each region (blast radius control)
Result: Regional outage affects only one primary cell; bad deployment affects only one shard

This layered approach provides both geographic diversity and fine-grained failure isolation.

Routing Strategies

Routing users to cells is a critical design decision. The routing mechanism determines which users are affected by cell failures and how quickly you can shift traffic.

Route 53 Routing Policies

AWS Route 53 provides several routing policies suitable for cell-based architectures. See Route 53 DNS for detailed configuration.

Weighted Routing

Distribute traffic across cells by percentage:

Cell 1: 33% of traffic
Cell 2: 33% of traffic
Cell 3: 34% of traffic

How it works: Route 53 returns a cell's endpoint with probability proportional to its weight. A user who receives Cell 1's IP address will consistently route to Cell 1 (DNS caching ensures sticky routing).

Code example:

# Terraform configuration for weighted routing
resource "aws_route53_record" "cell_1" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  weighted_routing_policy {
    weight = 33
  }

  set_identifier = "cell-1"
  alias {
    name                   = aws_lb.cell_1_alb.dns_name
    zone_id                = aws_lb.cell_1_alb.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "cell_2" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  weighted_routing_policy {
    weight = 33
  }

  set_identifier = "cell-2"
  alias {
    name                   = aws_lb.cell_2_alb.dns_name
    zone_id                = aws_lb.cell_2_alb.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "cell_3" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  weighted_routing_policy {
    weight = 34
  }

  set_identifier = "cell-3"
  alias {
    name                   = aws_lb.cell_3_alb.dns_name
    zone_id                = aws_lb.cell_3_alb.zone_id
    evaluate_target_health = true
  }
}

evaluate_target_health = true is critical: if a cell's ALB health checks fail, Route 53 automatically stops routing traffic to that cell. This provides automatic failover.

Advantages:

Simple to configure and understand
Automatic failover via health checks
Gradual traffic shifts (change weights to drain a cell)

Disadvantages:

No stickiness guarantee: DNS TTL expiration can move users between cells
Cannot control which specific users go to which cell
DNS caching delays propagation of weight changes (typically 60-300 seconds)

When to use: Stateless applications, or stateful applications with cross-cell session storage (see Data Partitioning).

Geolocation Routing

Route users based on geographic location:

Users in North America → Cell 1 (us-east-1)
Users in Europe → Cell 2 (eu-west-1)
Users in Asia → Cell 3 (ap-southeast-1)
Default → Cell 1

Advantages:

Latency optimization (users routed to nearest cell)
Data sovereignty compliance (EU users stay in EU)
Predictable cell assignment per geography

Disadvantages:

Uneven load distribution (depends on user geography)
Requires multi-region deployment
Failover requires updating DNS or using health checks to redirect entire geographies

When to use: Global applications with regional compliance requirements or latency sensitivity.

See Route 53 DNS for examples.

Latency-Based Routing

Route users to the cell with the lowest network latency:

User in Boston → Cell in us-east-1 (10ms latency)
User in Boston → Cell in eu-west-1 (80ms latency)
Result: Route to us-east-1

AWS Route 53 measures latency from users' DNS resolvers to your cells and selects the lowest-latency option.

Advantages:

Best user experience (automatic latency optimization)
Adapts to changing network conditions
No manual geography mapping required

Disadvantages:

Unpredictable load distribution
DNS resolver location may not match user location
Harder to reason about which users are in which cell

When to use: Global applications prioritizing user experience where predictable cell assignment is less important.

Application-Layer Routing

For shard-based cells, routing happens at the application layer rather than DNS.

Consistent Hashing

Map users to cells using a hash function:

public class CellRouter {
    private final List<String> cellEndpoints;
    private final ConsistentHash<String> hashRing;

    public CellRouter(List<String> cellEndpoints) {
        this.cellEndpoints = cellEndpoints;
        // Create consistent hash ring with virtual nodes for better distribution
        this.hashRing = new ConsistentHash<>(
            Hashing.murmur3_128(),
            100, // virtual nodes per cell
            cellEndpoints
        );
    }

    /**
     * Determine which cell should handle this user.
     * Same user always routes to same cell (unless cells are added/removed).
     */
    public String getCellForUser(String userId) {
        return hashRing.get(userId);
    }

    /**
     * Get the cell endpoint URL for API calls.
     */
    public String getCellEndpoint(String userId) {
        String cellId = getCellForUser(userId);
        return "https://" + cellId + ".api.example.com";
    }
}

Consistent hashing properties:

Deterministic: Same user always routes to same cell
Minimal disruption: Adding/removing cells only remaps ~1/N of users (N = number of cells)
Even distribution: Virtual nodes ensure balanced load across cells

Disadvantages:

Sticky failures: If a user's cell is down, that user cannot access the system (no automatic failover)
Complex implementation: Requires custom routing logic in the application
Client-side routing: Mobile/web clients must know the routing algorithm, or use a routing service

Mitigation for sticky failures: Implement fallback routing:

public String getCellEndpointWithFallback(String userId) {
    String primaryCell = getCellForUser(userId);

    // Check if primary cell is healthy
    if (healthChecker.isHealthy(primaryCell)) {
        return "https://" + primaryCell + ".api.example.com";
    }

    // Fall back to next cell in hash ring
    String fallbackCell = hashRing.getNext(userId);
    logger.warn("Cell {} unhealthy for user {}, failing over to {}",
        primaryCell, userId, fallbackCell);

    return "https://" + fallbackCell + ".api.example.com";
}

This trades off stickiness (user might move cells during incident) for availability (user can still access system).

API Gateway with Lambda Authorizer

Use AWS API Gateway with a Lambda authorizer to route requests:

Lambda Authorizer code:

// Lambda authorizer for cell routing
import { APIGatewayTokenAuthorizerEvent, APIGatewayAuthorizerResult } from 'aws-lambda';
import * as jwt from 'jsonwebtoken';

const CELL_ENDPOINTS = [
  'https://cell1.internal.example.com',
  'https://cell2.internal.example.com',
  'https://cell3.internal.example.com',
];

function hashUserId(userId: string): number {
  // Simple hash - use MurmurHash or similar for production
  let hash = 0;
  for (let i = 0; i < userId.length; i++) {
    hash = ((hash << 5) - hash) + userId.charCodeAt(i);
    hash = hash & hash; // Convert to 32-bit integer
  }
  return Math.abs(hash);
}

export async function handler(event: APIGatewayTokenAuthorizerEvent): Promise<APIGatewayAuthorizerResult> {
  try {
    // Verify JWT token
    const token = event.authorizationToken.replace('Bearer ', '');
    const decoded = jwt.verify(token, process.env.JWT_SECRET!) as { userId: string };

    // Route to cell based on user ID
    const cellIndex = hashUserId(decoded.userId) % CELL_ENDPOINTS.length;
    const cellEndpoint = CELL_ENDPOINTS[cellIndex];

    return {
      principalId: decoded.userId,
      policyDocument: {
        Version: '2012-10-17',
        Statement: [{
          Action: 'execute-api:Invoke',
          Effect: 'Allow',
          Resource: event.methodArn,
        }],
      },
      context: {
        userId: decoded.userId,
        cellEndpoint: cellEndpoint, // Pass to integration
      },
    };
  } catch (error) {
    throw new Error('Unauthorized');
  }
}

Advantages:

Centralized routing logic (easier to update)
Transparent to clients (just use one API Gateway URL)
Can implement complex routing rules (A/B testing, canary rollouts)

Disadvantages:

API Gateway cost (per request)
Lambda authorizer latency (though caching helps)
Additional moving parts

When to use: When you need centralized control over routing, or clients cannot implement routing logic.

See API Gateway for detailed integration patterns.

Data Partitioning Across Cells

Data partitioning is the most challenging aspect of cell-based architecture. The goal is to ensure each cell can operate independently without relying on data in other cells.

Cell-Local Data (Sharding)

Each cell stores a subset of the total dataset. A user's data lives entirely in one cell.

Sharding strategy: Determine the partition key (e.g., user ID, tenant ID) and distribution algorithm (e.g., hash, range).

Example: Hash-based sharding:

// Determine which cell's database contains this user's data
public class DataPartitioner {
    private final int numberOfCells;

    public int getCellNumber(String userId) {
        return Math.abs(userId.hashCode()) % numberOfCells;
    }

    public String getDatabaseEndpoint(String userId) {
        int cellNumber = getCellNumber(userId);
        return String.format("cell-%d-db.internal.example.com", cellNumber);
    }
}

Advantages:

True isolation: one cell's data corruption doesn't affect others
Scalable: add cells to increase capacity
No cross-cell dependencies during normal operation

Disadvantages:

Cannot query across shards: Global queries (e.g., "all users") require fanning out to all cells
Rebalancing complexity: Adding cells requires data migration
Hotspots: Uneven distribution can overload some cells (e.g., one celebrity user overwhelms a cell)

Handling hotspots: Monitor database load per cell. If a cell is consistently hot:

Vertical scaling: Increase that cell's database size temporarily
Sub-sharding: Split the hot shard into multiple cells
Consistent hashing with virtual nodes: Improves distribution uniformity

See Database Design for sharding patterns and AWS Databases for RDS/Aurora configuration.

Replicated Global Data

Some data must be accessible from all cells (e.g., product catalog, configuration). Replicate this data to each cell.

Replication strategies:

Aurora Global Database: Primary in one region, read replicas in other regions (cross-region cells)
DynamoDB Global Tables: Multi-region active-active replication (eventual consistency)
Event-driven replication: Publish changes to EventBridge/SNS, consume in each cell

Example: Event-driven replication:

// Product service publishes product updates to EventBridge
@Service
public class ProductService {
    private final EventBridgeClient eventBridge;

    public void updateProduct(Product product) {
        productRepository.save(product);

        // Publish event for cross-cell replication
        PutEventsRequestEntry event = PutEventsRequestEntry.builder()
            .source("product-service")
            .detailType("ProductUpdated")
            .detail(objectMapper.writeValueAsString(product))
            .eventBusName("global-product-events")
            .build();

        eventBridge.putEvents(r -> r.entries(event));
    }
}

// Each cell's product service consumes events to update local replica
@Component
public class ProductReplicationConsumer {
    @SqsListener(queueNames = "${product.replication.queue}")
    public void handleProductUpdate(ProductUpdatedEvent event) {
        // Update local read replica
        productReadRepository.upsert(event.getProduct());
    }
}

Consistency model: Replicated data is eventually consistent. Updates propagate within seconds to minutes. Application logic must tolerate stale reads.

When to use:

Read-heavy reference data (products, configurations)
Data that changes infrequently
Acceptable for reads to be slightly stale

See Event-Driven Architecture for event propagation patterns and AWS Messaging for EventBridge details.

Cross-Cell Session Storage

For stateful applications, session data must be accessible across cells to handle DNS routing changes or failover.

Options:

1. Centralized Session Store (ElastiCache Global Datastore)

Spring Boot configuration:

@Configuration
public class SessionConfig {
    @Bean
    public LettuceConnectionFactory redisConnectionFactory() {
        RedisStandaloneConfiguration config = new RedisStandaloneConfiguration(
            "cell-global-sessions.cache.amazonaws.com", 6379
        );
        return new LettuceConnectionFactory(config);
    }

    @Bean
    public RedisTemplate<String, Object> redisTemplate() {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(redisConnectionFactory());
        return template;
    }
}

// Use Spring Session for automatic session management
@Service
public class UserSessionService {
    private final RedisTemplate<String, Object> redisTemplate;

    public void storeSession(String sessionId, UserSession session) {
        redisTemplate.opsForValue().set(
            "session:" + sessionId,
            session,
            Duration.ofHours(24)
        );
    }

    public UserSession getSession(String sessionId) {
        return (UserSession) redisTemplate.opsForValue().get("session:" + sessionId);
    }
}

Advantages:

Users can seamlessly switch cells (DNS routing change, failover)
Centralized session visibility for debugging

Disadvantages:

Cross-cell dependency: If the global session store is down, all cells cannot authenticate users (defeats cell isolation)
Latency: Cross-region reads for session data add 50-200ms
Cost: Global replication infrastructure

Mitigation: Use with graceful degradation. If session store is unavailable, fall back to requiring re-authentication.

2. Shared-Nothing with JWT Tokens

Eliminate server-side session storage entirely. Store session data in JWT tokens:

@Service
public class JwtSessionService {
    @Value("${jwt.secret}")
    private String jwtSecret;

    /**
     * Create a JWT containing all session data.
     * No server-side storage required.
     */
    public String createSessionToken(User user) {
        return Jwts.builder()
            .setSubject(user.getId())
            .claim("email", user.getEmail())
            .claim("roles", user.getRoles())
            .setIssuedAt(new Date())
            .setExpiration(Date.from(Instant.now().plus(24, ChronoUnit.HOURS)))
            .signWith(SignatureAlgorithm.HS256, jwtSecret)
            .compact();
    }

    /**
     * Validate and extract session data from JWT.
     * Works in any cell - no database lookup needed.
     */
    public UserSession validateAndExtract(String token) {
        Claims claims = Jwts.parser()
            .setSigningKey(jwtSecret)
            .parseClaimsJws(token)
            .getBody();

        return new UserSession(
            claims.getSubject(),
            claims.get("email", String.class),
            claims.get("roles", List.class)
        );
    }
}

Advantages:

Perfect cell isolation: No shared session infrastructure
Scalable: No session storage to manage
Fast: No database/cache lookup per request

Disadvantages:

Token size: Including all session data increases token size (impacts request size)
Cannot invalidate: Once issued, JWT is valid until expiration (cannot force logout)
Sensitive data exposure: Session data is visible to client (though signed to prevent tampering)

When to use: Stateless applications, or applications where forced logout is rare.

See Authentication for JWT patterns.

Cell Isolation Mechanisms

True cell isolation requires separating all infrastructure layers.

Network Isolation

Each cell has its own VPC or isolated subnets:

# Terraform: Create isolated VPC per cell
resource "aws_vpc" "cell" {
  count = var.number_of_cells

  cidr_block = "10.${count.index}.0.0/16"

  tags = {
    Name = "cell-${count.index}-vpc"
    Cell = count.index
  }
}

# Each cell has private subnets across AZs
resource "aws_subnet" "cell_private" {
  count = var.number_of_cells * 3  # 3 AZs per cell

  vpc_id            = aws_vpc.cell[floor(count.index / 3)].id
  cidr_block        = "10.${floor(count.index / 3)}.${(count.index % 3) * 64}.0/18"
  availability_zone = data.aws_availability_zones.available.names[count.index % 3]

  tags = {
    Name = "cell-${floor(count.index / 3)}-private-${count.index % 3}"
    Cell = floor(count.index / 3)
  }
}

Benefits:

Network-level blast radius containment (one VPC misconfiguration doesn't affect other cells)
Security group isolation (cannot accidentally open cross-cell traffic)
Independent network troubleshooting

Cross-cell communication: Minimize but not eliminate. Use VPC peering or Transit Gateway only for essential global services (e.g., centralized logging).

See AWS Networking for VPC design patterns.

Compute Isolation

Separate ECS clusters, EKS clusters, or auto-scaling groups per cell:

# Example: Separate EKS cluster per cell
apiVersion: v1
kind: ConfigMap
metadata:
  name: cell-config
  namespace: default
data:
  CELL_ID: "cell-1"
  CELL_REGION: "us-east-1"
  DATABASE_ENDPOINT: "cell-1-db.internal.example.com"
  CACHE_ENDPOINT: "cell-1-cache.internal.example.com"
  # Only talk to resources within this cell

Why separate clusters:

Control plane isolation (Kubernetes API server failure affects only one cell)
Independent cluster upgrades (roll out new Kubernetes version to one cell at a time)
Resource quotas per cell (prevent one cell from starving others)

See AWS EKS for cluster design and AWS Compute for ECS patterns.

Database Isolation

Each cell has its own database instance (RDS, Aurora, DynamoDB table):

# Terraform: Separate RDS instance per cell
resource "aws_db_instance" "cell" {
  count = var.number_of_cells

  identifier = "cell-${count.index}-db"

  engine         = "postgres"
  engine_version = "16.6"
  instance_class = "db.r6g.large"

  # Each cell's database in its own VPC
  db_subnet_group_name   = aws_db_subnet_group.cell[count.index].name
  vpc_security_group_ids = [aws_security_group.cell_db[count.index].id]

  # Isolated from other cells
  publicly_accessible = false

  tags = {
    Name = "cell-${count.index}-database"
    Cell = count.index
  }
}

Cost consideration: Running N database instances costs N times as much as a single instance. Mitigate by:

Right-sizing: Size each cell's database for 1/N of traffic, not full traffic
Serverless databases: Aurora Serverless v2 scales down during low traffic
Reserved instances: Commit to cell databases for cost savings

See AWS Databases for RDS optimization.

Deployment Isolation

Deploy changes to one cell at a time (canary deployment at cell level):

# GitLab CI: Deploy to cells sequentially
stages:
  - deploy_cell_1
  - validate_cell_1
  - deploy_cell_2
  - validate_cell_2
  - deploy_cell_3

deploy_cell_1:
  stage: deploy_cell_1
  script:
    - kubectl --context=cell-1 apply -f k8s/
    - kubectl --context=cell-1 rollout status deployment/app

validate_cell_1:
  stage: validate_cell_1
  script:
    - ./scripts/health-check.sh cell-1
    - ./scripts/smoke-test.sh cell-1
  # If validation fails, pipeline stops here - cells 2 and 3 not deployed

deploy_cell_2:
  stage: deploy_cell_2
  script:
    - kubectl --context=cell-2 apply -f k8s/
    - kubectl --context=cell-2 rollout status deployment/app
  when: on_success  # Only if cell-1 validation passed

Progressive rollout strategy:

Deploy to Cell 1 (smallest blast radius: 1/N of users)
Monitor metrics for 30-60 minutes
If healthy, deploy to Cell 2
Continue until all cells updated

If issues are detected in Cell 1, stop the rollout. Only 1/N of users affected.

See GitLab CI/CD Pipelines for detailed pipeline patterns.

Cell Health Monitoring and Failover

Health Checks

Implement comprehensive health checks at multiple layers:

@RestController
public class HealthCheckController {
    private final DataSource dataSource;
    private final RedisTemplate<String, String> redis;

    /**
     * Liveness check: Is the application running?
     * Used by Kubernetes to restart crashed pods.
     */
    @GetMapping("/health/live")
    public ResponseEntity<Map<String, String>> liveness() {
        return ResponseEntity.ok(Map.of("status", "UP"));
    }

    /**
     * Readiness check: Can the application serve traffic?
     * Used by load balancers to route traffic.
     */
    @GetMapping("/health/ready")
    public ResponseEntity<Map<String, Object>> readiness() {
        Map<String, Object> health = new HashMap<>();

        // Check database connectivity
        boolean dbHealthy = checkDatabase();
        health.put("database", dbHealthy ? "UP" : "DOWN");

        // Check cache connectivity
        boolean cacheHealthy = checkCache();
        health.put("cache", cacheHealthy ? "UP" : "DOWN");

        // Overall status
        boolean healthy = dbHealthy && cacheHealthy;
        health.put("status", healthy ? "UP" : "DOWN");

        return ResponseEntity
            .status(healthy ? 200 : 503)
            .body(health);
    }

    private boolean checkDatabase() {
        try (Connection conn = dataSource.getConnection()) {
            return conn.isValid(5);  // 5 second timeout
        } catch (Exception e) {
            return false;
        }
    }

    private boolean checkCache() {
        try {
            redis.opsForValue().get("health-check-key");
            return true;
        } catch (Exception e) {
            return false;
        }
    }
}

Health check layers:

Application Load Balancer health checks: Check /health/ready every 30 seconds
Route 53 health checks: Monitor ALB endpoints, remove unhealthy cells from DNS
CloudWatch alarms: Alert on health check failures, high error rates, latency spikes

# Terraform: Route 53 health check for cell
resource "aws_route53_health_check" "cell" {
  count = var.number_of_cells

  fqdn              = "cell-${count.index}.api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health/ready"
  failure_threshold = 3
  request_interval  = 30

  tags = {
    Name = "cell-${count.index}-health"
    Cell = count.index
  }
}

# CloudWatch alarm on health check failure
resource "aws_cloudwatch_metric_alarm" "cell_unhealthy" {
  count = var.number_of_cells

  alarm_name          = "cell-${count.index}-unhealthy"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HealthCheckStatus"
  namespace           = "AWS/Route53"
  period              = 60
  statistic           = "Minimum"
  threshold           = 1
  alarm_description   = "Cell ${count.index} is unhealthy"

  dimensions = {
    HealthCheckId = aws_route53_health_check.cell[count.index].id
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

See Observability for alerting strategies.

Automatic Failover

Route 53 with health checks provides automatic DNS-based failover:

Cell 1: HEALTHY → receives traffic
Cell 2: HEALTHY → receives traffic
Cell 3: UNHEALTHY → removed from DNS rotation

Users currently routed to Cell 3:

New requests: DNS resolves to Cell 1 or Cell 2 only
Existing connections: Continue until DNS TTL expires (typically 60 seconds), then reconnect to healthy cell

Failover timeline:

Total failover time: ~4 minutes (failure detection + DNS propagation + client reconnection). This is acceptable for most applications, but mission-critical systems may need faster failover.

Faster failover with client-side retries:

// TypeScript client with automatic retry to fallback cell
async function callAPI(endpoint: string, maxRetries = 2): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fetch(endpoint, { timeout: 5000 });
      if (response.ok) return response;

      // Server error - might be cell failure
      if (response.status >= 500) {
        console.warn(`Cell returned ${response.status}, retrying...`);
        continue;
      }

      return response;  // Client error - don't retry
    } catch (error) {
      console.error(`Request failed (attempt ${attempt + 1}):`, error);
      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, attempt)));
    }
  }

  throw new Error('Max retries exceeded');
}

Clients retry failed requests immediately, achieving sub-second failover from the user's perspective.

See Spring Boot Resilience for retry patterns and circuit breakers.

Disaster Recovery with Cells

Cell-based architecture provides natural disaster recovery capabilities, but requires planning for regional failures.

Multi-Region Active-Active

Run cells in multiple AWS regions, all actively serving traffic:

Failure scenario: Entire us-east-1 region goes down.

Impact: Cells 1-3 fail (33% of cells). Route 53 health checks detect failure and remove those cells from DNS. Remaining cells (4-9) absorb the traffic.

Data considerations:

Shard-based cells: Each user's data exists in only one region. Users whose data is in us-east-1 cannot access the system until region recovers, OR you replicate data cross-region (complex).
Replicated data: Global reference data remains available (replicated to all regions).

Trade-off: True active-active multi-region requires cross-region data replication with complex consistency management. See Disaster Recovery for multi-region strategies.

Multi-Region Active-Passive

Primary cells in one region (us-east-1), standby cells in another (us-west-2):

Normal operation: All traffic goes to us-east-1 cells. us-west-2 cells are running but idle (or scaled to zero).

Disaster scenario: us-east-1 region fails.

Recovery steps:

Route 53 health checks detect us-east-1 cells unhealthy
Route 53 switches to failover cells in us-west-2
Standby cells scale up to handle full traffic load
Database replicas in us-west-2 promoted to primary

RTO: 5-15 minutes (DNS propagation + standby scale-up + database promotion)

RPO: Depends on replication lag. Aurora Global Database provides <1 second RPO.

Cost optimization: Run standby cells at minimal capacity (single small instance) to reduce costs. Scale up only during failover.

See Disaster Recovery for RTO/RPO planning.

Deployment Strategies for Cells

Phased Rollouts

Deploy changes to cells sequentially, monitoring each before proceeding:

#!/bin/bash
# deploy-cells.sh - Progressive cell deployment with validation

CELLS=("cell-1" "cell-2" "cell-3" "cell-4" "cell-5")

for CELL in "${CELLS[@]}"; do
  echo "Deploying to $CELL..."

  # Deploy to cell
  kubectl --context=$CELL apply -f k8s/
  kubectl --context=$CELL rollout status deployment/app --timeout=5m

  # Wait for deployment to stabilize
  echo "Monitoring $CELL for 15 minutes..."
  sleep 900

  # Validate health metrics
  ERROR_RATE=$(curl -s "https://$CELL.api.example.com/metrics" | jq '.error_rate')
  LATENCY_P99=$(curl -s "https://$CELL.api.example.com/metrics" | jq '.latency_p99')

  # Check if metrics are healthy
  if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "ERROR: High error rate in $CELL ($ERROR_RATE). Stopping rollout."
    exit 1
  fi

  if (( $(echo "$LATENCY_P99 > 1000" | bc -l) )); then
    echo "ERROR: High latency in $CELL (${LATENCY_P99}ms). Stopping rollout."
    exit 1
  fi

  echo "$CELL is healthy. Proceeding to next cell."
done

echo "All cells deployed successfully!"

Benefits:

Early detection of issues (first cell acts as canary)
Limited blast radius (stop before deploying to all cells)
Time to observe and react (15-minute soak time per cell)

Canary Cells

Designate one cell as a canary for new deployments:

Cell 1: Canary (5% of traffic) → Deploy here first
Cell 2: Production (31.67% of traffic)
Cell 3: Production (31.67% of traffic)
Cell 4: Production (31.67% of traffic)

Process:

Deploy new version to Cell 1 only
Monitor for 24-48 hours
If healthy, deploy to remaining cells

Advantages:

Minimal blast radius for risky changes (only 5% of users)
Real production traffic validation (not just synthetic tests)
Fast rollback (revert Cell 1 only)

Route 53 weighted routing for canary:

resource "aws_route53_record" "cell_1_canary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  weighted_routing_policy {
    weight = 5  # 5% of traffic
  }

  set_identifier = "cell-1-canary"
  alias {
    name                   = aws_lb.cell_1_alb.dns_name
    zone_id                = aws_lb.cell_1_alb.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "cell_2_production" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  weighted_routing_policy {
    weight = 32  # ~32% of traffic
  }

  set_identifier = "cell-2"
  alias {
    name                   = aws_lb.cell_2_alb.dns_name
    zone_id                = aws_lb.cell_2_alb.zone_id
    evaluate_target_health = true
  }
}

# Repeat for cells 3 and 4...

Blue-Green at Cell Level

Maintain two sets of cells (blue and green), switch all traffic atomically:

Deployment process:

Green cells run new version but receive no traffic
Test green cells with synthetic traffic
Update Route 53 to shift traffic to green cells (change DNS weights)
Monitor green cells under production load
If issues arise, shift traffic back to blue cells (fast rollback)
Once stable, decommission blue cells or repurpose as new standby

Advantages:

Instant rollback (just change DNS)
Full production validation before cutover
Zero-downtime deployments

Disadvantages:

Cost: Running double infrastructure during transition
Data migration: Database schema changes require careful planning (must be compatible with both versions)

See GitLab CI/CD Pipelines for blue-green automation.

Common Anti-Patterns

Cells Too Small

Problem: Creating too many cells increases operational overhead without meaningful blast radius reduction.

Example: 100 cells in a system serving 10,000 users → 100 users per cell.

Why it's bad:

100 separate deployments to manage
100 separate database instances to maintain
Diminishing returns: difference between 99% and 99.9% availability is small for most applications
Increased costs (infrastructure overhead per cell)

Rule of thumb: Aim for 3-10 cells initially. Add more as scale demands.

Cells Too Large

Problem: Cells so large that a single cell failure is catastrophic.

Example: 2 cells in a mission-critical system → 50% of users affected by one cell failure.

Why it's bad:

Blast radius too large (half your users impacted)
Defeats the purpose of cell-based architecture

Rule of thumb: Ensure a single cell failure affects <20% of users (5+ cells minimum).

Tight Coupling Between Cells

Problem: Cells depend on each other for normal operation.

Example:

// BAD: Cell 1 calls Cell 2's database directly
public class PaymentService {
    public PaymentResult processPayment(PaymentRequest request) {
        // Cross-cell database call - creates dependency!
        Account account = cell2DatabaseClient.getAccount(request.accountId);

        if (account.balance < request.amount) {
            return PaymentResult.rejected("Insufficient funds");
        }

        // ... process payment
    }
}

Why it's bad:

Cell 1 cannot operate if Cell 2 is down
Cascading failures (Cell 2 overload affects Cell 1)
Defeats isolation benefits

Fix: Each cell owns its data. If you need cross-cell data, replicate it or use event-driven patterns:

// GOOD: Cell has local replica of account data
public class PaymentService {
    private final AccountReadRepository accountReadRepository; // Local replica

    public PaymentResult processPayment(PaymentRequest request) {
        // Read from local replica - no cross-cell dependency
        Account account = accountReadRepository.findById(request.accountId);

        if (account.balance < request.amount) {
            return PaymentResult.rejected("Insufficient funds");
        }

        // ... process payment
    }
}

Ignoring Cell Affinity

Problem: Users randomly moved between cells, breaking session state or data locality.

Example: User routed to Cell 1 at 10:00 AM, then Cell 2 at 10:05 AM due to DNS TTL changes.

Why it's bad:

Session loss (user logged out unexpectedly)
Inefficient database queries (user's data is in Cell 1, but Cell 2 must fetch it cross-cell or fail)

Fix: Use consistent routing (hash-based, or sticky DNS with long TTL):

// Consistent routing ensures same user always goes to same cell
public class CellRouter {
    public String getCellForUser(String userId) {
        // Same user ID always hashes to same cell
        int cellIndex = Math.abs(userId.hashCode()) % numberOfCells;
        return "cell-" + cellIndex;
    }
}

Insufficient Health Checks

Problem: Health checks only verify application is running, not that it can serve traffic correctly.

Example:

// BAD: Only checks if process is alive
@GetMapping("/health")
public String health() {
    return "OK";  // Always returns OK even if database is down!
}

Why it's bad:

Cell marked healthy even though database is down
Users routed to broken cell
Delayed incident detection

Fix: Check all critical dependencies:

// GOOD: Comprehensive health check
@GetMapping("/health/ready")
public ResponseEntity<String> readiness() {
    boolean dbHealthy = checkDatabase();
    boolean cacheHealthy = checkCache();
    boolean downstreamHealthy = checkDownstreamServices();

    if (dbHealthy && cacheHealthy && downstreamHealthy) {
        return ResponseEntity.ok("READY");
    } else {
        return ResponseEntity.status(503).body("NOT READY");
    }
}

No Graceful Degradation

Problem: Cell failure results in hard errors rather than degraded functionality.

Example: User cannot view account balance because the account service cell is down, even though the data is cached.

Fix: Implement fallbacks:

@Service
public class AccountService {
    private final AccountServiceClient accountClient;
    private final RedisTemplate<String, Account> cache;

    /**
     * Get account with fallback to cache if primary cell is down.
     */
    public Optional<Account> getAccount(String accountId) {
        try {
            // Try primary cell
            return Optional.of(accountClient.getAccount(accountId));
        } catch (Exception e) {
            // Primary cell down - fall back to cached data
            logger.warn("Account service unavailable, using cached data for {}", accountId);
            Account cached = cache.opsForValue().get("account:" + accountId);
            return Optional.ofNullable(cached);
        }
    }
}

Users get slightly stale data instead of complete failure.

Cost Implications

Cell-based architecture increases infrastructure costs. Optimize by:

Right-Sizing Cells

Size each cell for its share of traffic, not full capacity:

Traditional HA: 2 large instances (active-passive) sized for 100% traffic
Cell-based: 3 cells with smaller instances sized for 33% traffic each

Example cost comparison:

Architecture	Instances	Instance Size	Monthly Cost
Active-Passive HA	2 × db.r6g.2xlarge	8 vCPU, 64 GB	$1,560
3 Cells	3 × db.r6g.large	2 vCPU, 16 GB	$1,170

Even though you run more instances, total cost is lower because each is smaller.

Aurora Serverless v2 for Variable Load

Use Aurora Serverless v2 for cells with variable traffic:

resource "aws_rds_cluster" "cell_db" {
  count = var.number_of_cells

  cluster_identifier = "cell-${count.index}-db"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"

  serverlessv2_scaling_configuration {
    min_capacity = 0.5   # Scale down to 0.5 ACU during low traffic
    max_capacity = 4.0   # Scale up to 4 ACU during peak
  }
}

Cost savings: Pay for actual usage rather than peak capacity. Cells scale down automatically during low traffic periods.

Spot Instances for Non-Critical Cells

Use EC2 Spot Instances for canary cells or lower-priority traffic:

resource "aws_eks_node_group" "cell_canary_spot" {
  cluster_name    = aws_eks_cluster.cell_canary.name
  node_group_name = "canary-spot-nodes"

  scaling_config {
    desired_size = 2
    max_size     = 5
    min_size     = 1
  }

  capacity_type = "SPOT"  # Use Spot instances (60-90% cheaper)
  instance_types = ["t3.large", "t3a.large", "t2.large"]  # Multiple types for availability
}

Savings: 60-90% cost reduction. Acceptable for canary cells (small blast radius if Spot instances reclaimed).

Reserved Instances for Production Cells

Commit to production cells with Reserved Instances or Savings Plans:

1-year commitment: ~30% savings
3-year commitment: ~50% savings

Reserve capacity for your baseline cell infrastructure. Use On-Demand for burst capacity.

See AWS Cost Optimization for detailed cost strategies.

Summary

Cell-based architecture is a powerful resilience pattern that limits blast radius by partitioning infrastructure into isolated, independently operating cells. Key principles:

Failure domain isolation: Each cell is a complete stack (compute, database, network). Failures are contained to one cell.
Data partitioning: Users' data lives in one cell (sharding) or is replicated to all cells (global data). Minimize cross-cell dependencies.
Routing: Use DNS-based routing (Route 53) for simple multi-region cells, or application-layer routing (consistent hashing) for fine-grained shard-based cells.
Health checks and failover: Comprehensive health checks at all layers. Route 53 automatically removes unhealthy cells from DNS.
Deployment strategies: Deploy to cells sequentially (phased rollout) or use canary cells to minimize risk.
Cost optimization: Right-size cells, use serverless databases, leverage Spot instances for non-critical cells.

When to use cell-based architecture:

Large-scale systems where availability is critical
Applications that can tolerate eventual consistency
Systems with natural data partitioning (multi-tenant, geographic)

When NOT to use:

Small-scale applications (operational overhead outweighs benefits)
Applications requiring strong cross-shard consistency
Tightly coupled monoliths (refactor to microservices first)

Why Cell-Based Architecture?​

The Blast Radius Problem​

How Cells Limit Blast Radius​

Cell Design Patterns​

Region-Based Cells​

Availability Zone-Based Cells​

Shard-Based Cells​

Hybrid Approaches​

Routing Strategies​

Route 53 Routing Policies​

Weighted Routing​

Geolocation Routing​

Latency-Based Routing​

Application-Layer Routing​

Consistent Hashing​

API Gateway with Lambda Authorizer​

Data Partitioning Across Cells​

Cell-Local Data (Sharding)​

Replicated Global Data​

Cross-Cell Session Storage​

1. Centralized Session Store (ElastiCache Global Datastore)​

2. Shared-Nothing with JWT Tokens​

Cell Isolation Mechanisms​

Network Isolation​

Compute Isolation​

Database Isolation​

Deployment Isolation​

Cell Health Monitoring and Failover​

Health Checks​

Automatic Failover​

Disaster Recovery with Cells​

Multi-Region Active-Active​

Multi-Region Active-Passive​

Deployment Strategies for Cells​

Phased Rollouts​

Canary Cells​

Blue-Green at Cell Level​

Common Anti-Patterns​

Cells Too Small​

Cells Too Large​

Tight Coupling Between Cells​

Ignoring Cell Affinity​

Insufficient Health Checks​

No Graceful Degradation​

Cost Implications​

Right-Sizing Cells​

Aurora Serverless v2 for Variable Load​

Spot Instances for Non-Critical Cells​

Reserved Instances for Production Cells​

Summary​

Further Reading​

Why Cell-Based Architecture?

The Blast Radius Problem

How Cells Limit Blast Radius

Cell Design Patterns

Region-Based Cells

Availability Zone-Based Cells

Shard-Based Cells

Hybrid Approaches

Routing Strategies

Route 53 Routing Policies

Weighted Routing

Geolocation Routing

Latency-Based Routing

Application-Layer Routing

Consistent Hashing

API Gateway with Lambda Authorizer

Data Partitioning Across Cells

Cell-Local Data (Sharding)

Replicated Global Data

Cross-Cell Session Storage

1. Centralized Session Store (ElastiCache Global Datastore)

2. Shared-Nothing with JWT Tokens

Cell Isolation Mechanisms

Network Isolation

Compute Isolation

Database Isolation

Deployment Isolation

Cell Health Monitoring and Failover

Health Checks

Automatic Failover

Disaster Recovery with Cells

Multi-Region Active-Active

Multi-Region Active-Passive

Deployment Strategies for Cells

Phased Rollouts

Canary Cells

Blue-Green at Cell Level

Common Anti-Patterns

Cells Too Small

Cells Too Large

Tight Coupling Between Cells

Ignoring Cell Affinity

Insufficient Health Checks

No Graceful Degradation

Cost Implications

Right-Sizing Cells

Aurora Serverless v2 for Variable Load

Spot Instances for Non-Critical Cells

Reserved Instances for Production Cells

Summary

Further Reading