Disaster Recovery and Business Continuity

What is Disaster Recovery?

Disaster Recovery (DR) is the process of restoring systems and data after a catastrophic failure. Business Continuity (BC) is the broader strategy ensuring business operations continue during and after a disaster.

Disaster scenarios:

Infrastructure failures: Data center outage, network partition, cloud region failure
Data loss: Database corruption, accidental deletion, ransomware encryption
Application failures: Critical bug causing data corruption, cascading service failures
Human error: Accidental configuration changes, deployment of broken code
Security incidents: Ransomware attack, data breach, DDoS attack
Natural disasters: Fire, flood, earthquake affecting physical infrastructure

The goal of DR is to minimize two critical metrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

RTO and RPO: Defining Recovery Targets

RTO and RPO define how much downtime and data loss are acceptable for your business. They drive all DR decisions and investments.

Recovery Time Objective (RTO)

Definition: Maximum acceptable time a system can be down after a disaster.

Question answered: "How long can we be offline before business impact is unacceptable?"

RTO Examples by Service Tier:

Service Tier	RTO Target	Example Services	Implications
Mission Critical	< 1 hour	Payment processing, authentication	Multi-region active-active, automated failover, 24/7 on-call
Business Critical	1-4 hours	Customer accounts, transaction history	Multi-region active-passive, documented runbooks, business hours support
Important	4-24 hours	Reporting, analytics dashboards	Single region with backups, restore from backup acceptable
Non-Critical	1-7 days	Internal tools, staging environments	Backup-only, manual rebuild acceptable

Cost increases exponentially as RTO decreases:

Setting RTO:

Ask stakeholders:

"How much revenue do we lose per hour of downtime?"
"What regulatory requirements exist for uptime?"
"What's the competitive impact of extended outages?"
"What can we realistically afford to implement?"

Recovery Point Objective (RPO)

Definition: Maximum acceptable age of data after recovery (how much data loss is tolerable).

Question answered: "How much data can we afford to lose?"

Example: Your database backs up hourly. Disaster strikes at 3:45 PM. Last backup was 3:00 PM. You lose 45 minutes of data (all transactions between 3:00 and 3:45 PM).

RPO Examples by Data Type:

Data Type	RPO Target	Backup Strategy	Example
Financial transactions	Near-zero (seconds)	Continuous replication, transaction logs	Payment records, account balances
User-generated content	< 5 minutes	Synchronous multi-region writes	User posts, messages, uploads
Operational data	< 1 hour	Asynchronous replication, hourly snapshots	Session data, cache, analytics
Historical/archival data	< 24 hours	Daily backups, archive storage	Old reports, audit logs
Derived/rebuildable data	No requirement	No backups (rebuild from source)	Caches, search indexes, aggregates

Setting RPO:

Ask:

"How much transaction data can we afford to lose?"
"Can we reconstruct lost data from other sources?"
"What are legal/regulatory requirements for data retention?"
"What's the cost of data loss (customer trust, revenue, compliance)?"

RTO vs RPO Trade-offs

RTO and RPO are independent but related:

Quadrant 1 (Low RTO, Low RPO): Most expensive - requires active-active multi-region, continuous replication, automated failover

Quadrant 2 (High RTO, Low RPO): Data precious but downtime acceptable - focus on backups, can restore slowly

Quadrant 3 (High RTO, High RPO): Standard backup strategy - daily backups, manual restore

Quadrant 4 (Low RTO, High RPO): Fast recovery but data loss acceptable - stateless services, quick redeploy

Backup Strategies

Backups are the foundation of disaster recovery. Without backups, you have no recovery capability.

Backup Types

1. Full Backup

Copies all data every time. Simple but slow and storage-intensive.

Pros:

Simple recovery (single backup contains everything)
No dependency on previous backups
Fastest restore (single source)

Cons:

Slowest backup (copies everything every time)
Highest storage cost (duplicates unchanged data)
Resource-intensive (impacts production during backup)

Use case: Small databases (< 100 GB), weekly deep backups

2. Incremental Backup

Backs up only data changed since last backup (full or incremental).

Recovery: Restore full backup, then apply each incremental in sequence.

Pros:

Fastest backups (only changed data)
Lowest storage usage
Minimal production impact

Cons:

Slowest recovery (need full + all incrementals)
Complex restore process
Dependency chain (if one incremental is corrupted, later ones are unusable)

Use case: Large databases with frequent backups

3. Differential Backup

Backs up data changed since last full backup.

Recovery: Restore full backup + most recent differential (only 2 files).

Pros:

Faster recovery than incremental (only 2 restore operations)
Simpler than incremental (no chain dependency)

Cons:

Slower backups than incremental (each differential grows over week)
More storage than incremental

Use case: Balance between backup speed and recovery speed

4. Continuous Data Protection (CDP)

Captures every change as it happens (transaction log streaming, database replication).

Pros:

Near-zero RPO (only lose in-flight transactions)
Point-in-time recovery to any second
Fastest recovery (replica is always current)

Cons:

Most expensive (continuous streaming, storage, replica infrastructure)
Complex to manage
Requires specific database features (logical replication, WAL archiving)

Use case: Mission-critical systems with strict RPO requirements

Backup Strategy Selection Matrix

Data Size	Change Rate	RPO Target	Recommended Strategy
< 100 GB	Low	24 hours	Daily full backups
< 100 GB	High	< 1 hour	Hourly full backups + transaction logs
100 GB - 1 TB	Low	24 hours	Weekly full + daily differential
100 GB - 1 TB	High	< 1 hour	Daily full + hourly incremental + transaction logs
> 1 TB	Low	24 hours	Weekly full + daily incremental
> 1 TB	High	< 5 min	Continuous replication + transaction log streaming

Backup Implementation Example (PostgreSQL)

#!/bin/bash
# Daily full backup with transaction log archiving

# Configuration
DB_NAME="payments_db"
BACKUP_DIR="/backups/postgres"
S3_BUCKET="s3://company-backups/postgres"
RETENTION_DAYS=30

# Timestamp for this backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_full_${TIMESTAMP}.dump"

# 1. Full backup using pg_dump
echo "Starting full backup..."
pg_dump \
    --format=custom \
    --compress=9 \
    --file="${BACKUP_FILE}" \
    --verbose \
    "${DB_NAME}"

# 2. Verify backup integrity
echo "Verifying backup..."
pg_restore --list "${BACKUP_FILE}" > /dev/null
if [ $? -eq 0 ]; then
    echo "✓ Backup verified successfully"
else
    echo "✗ Backup verification failed!"
    exit 1
fi

# 3. Upload to S3
echo "Uploading to S3..."
aws s3 cp "${BACKUP_FILE}" "${S3_BUCKET}/full/" \
    --storage-class STANDARD_IA \
    --metadata "retention-days=${RETENTION_DAYS},timestamp=${TIMESTAMP}"

# 4. Cleanup old local backups (keep last 7 days locally)
find "${BACKUP_DIR}" -name "*.dump" -mtime +7 -delete

# 5. Archive transaction logs (WAL) for point-in-time recovery
# Configured in postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://company-backups/postgres/wal/%f'

echo "Backup completed: ${BACKUP_FILE}"

Transaction log archiving (postgresql.conf):

# Enable WAL archiving for point-in-time recovery
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://company-backups/postgres/wal/%f'
archive_timeout = 300  # Archive every 5 minutes even if WAL not full

# Replication for near-zero RPO
max_wal_senders = 5
wal_keep_size = 1GB

Backup Testing and Validation

Critical principle: Untested backups are not backups. You must regularly verify you can restore.

Backup testing schedule:

backup_testing:
  daily:
    - name: "Automated integrity check"
      action: "Verify backup file is not corrupted"
      tool: "pg_restore --list"

  weekly:
    - name: "Sample restore to staging"
      action: "Restore random backup to staging environment"
      validation: "Run smoke tests, query key tables"

  monthly:
    - name: "Full restore drill"
      action: "Complete restore from backup to isolated environment"
      validation: "Application teams verify data integrity"

  quarterly:
    - name: "Disaster recovery simulation"
      action: "Simulate complete data center failure"
      validation: "Restore and failover to DR region, end-to-end testing"

Automated restore verification:

import boto3
import subprocess
from datetime import datetime, timedelta

def verify_backup(backup_file):
    """Automated backup verification process"""

    # 1. Download backup from S3
    s3 = boto3.client('s3')
    s3.download_file('company-backups', f'postgres/full/{backup_file}', f'/tmp/{backup_file}')

    # 2. Verify integrity
    result = subprocess.run(
        ['pg_restore', '--list', f'/tmp/{backup_file}'],
        capture_output=True
    )
    if result.returncode != 0:
        alert(f"Backup verification failed: {backup_file}")
        return False

    # 3. Restore to test database
    subprocess.run([
        'pg_restore',
        '--dbname=test_restore_db',
        '--clean',
        '--if-exists',
        f'/tmp/{backup_file}'
    ])

    # 4. Run data integrity checks
    checks = [
        "SELECT COUNT(*) FROM payments",  # Ensure tables exist
        "SELECT MAX(created_at) FROM payments",  # Check data freshness
        "SELECT COUNT(*) FROM users WHERE email IS NULL"  # Constraint validation
    ]

    for check in checks:
        result = run_query('test_restore_db', check)
        if not validate_result(result):
            alert(f"Data integrity check failed: {check}")
            return False

    return True

Database Backup and Restore Procedures

Databases require special consideration for backups due to transaction consistency, referential integrity, and large data volumes.

Point-in-Time Recovery (PITR)

PITR enables restoring to any specific moment, not just backup snapshots. This is critical when:

Recovering from data corruption (restore to moment before corruption)
Investigating security incidents (restore to specific time to examine data)
Recovering from accidental deletion (restore to 5 minutes before DROP TABLE)

How PITR works:

PostgreSQL PITR restore:

#!/bin/bash
# Point-in-time recovery to specific timestamp

TARGET_TIME="2024-01-15 14:29:00"  # 1 minute before disaster
BACKUP_FILE="payments_db_full_20240114_230000.dump"
WAL_ARCHIVE="s3://company-backups/postgres/wal/"

# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Remove corrupted data directory
rm -rf /var/lib/postgresql/14/main/*

# 3. Restore base backup
pg_restore \
    --dbname=postgres \
    --create \
    "/backups/${BACKUP_FILE}"

# 4. Configure recovery
cat > /var/lib/postgresql/14/main/recovery.conf << EOF
restore_command = 'aws s3 cp ${WAL_ARCHIVE}%f %p'
recovery_target_time = '${TARGET_TIME}'
recovery_target_action = 'promote'
EOF

# 5. Start PostgreSQL - it will replay WAL to target time
systemctl start postgresql

# 6. Verify recovery
psql -c "SELECT NOW(), MAX(created_at) FROM payments;"

Logical vs Physical Backups

Logical Backup (pg_dump, mysqldump):

Exports data as SQL statements or custom format.

-- Example logical backup output
CREATE TABLE payments (
    id SERIAL PRIMARY KEY,
    amount DECIMAL(10,2),
    created_at TIMESTAMP
);

INSERT INTO payments VALUES (1, 99.99, '2024-01-15 10:00:00');
INSERT INTO payments VALUES (2, 149.50, '2024-01-15 10:05:00');

Pros:

Portable across PostgreSQL versions
Human-readable (SQL format)
Can restore individual tables
Platform-independent

Cons:

Slower for large databases
Larger backup files
Cannot do PITR
Must replay all INSERT statements during restore

Physical Backup (pg_basebackup, filesystem snapshots):

Copies actual database files (data pages, indexes, transaction logs).

Pros:

Faster backup and restore (file copy, not SQL replay)
Supports PITR (when combined with WAL archiving)
Exact replica of database state

Cons:

Version-specific (PostgreSQL 14 backup won't restore to PostgreSQL 15)
All-or-nothing (can't restore single table)
Requires identical architecture (can't restore from Linux to Windows)

Recommendation: Use both:

Physical backups for disaster recovery (fast restore, PITR)
Logical backups for migrations, development, selective restore

Backup Encryption and Security

Backups often contain sensitive data and must be secured.

Encryption at rest:

# Encrypt backup before uploading to S3
pg_dump payments_db | \
    gzip | \
    openssl enc -aes-256-cbc -salt -pbkdf2 -pass file:/etc/backup-encryption-key | \
    aws s3 cp - s3://company-backups/postgres/encrypted/backup_$(date +%Y%m%d).dump.gz.enc

S3 server-side encryption:

# Let AWS handle encryption (simpler, managed keys)
aws s3 cp backup.dump s3://company-backups/postgres/ \
    --server-side-encryption AES256 \
    --storage-class STANDARD_IA

Access controls:

# IAM policy - only backup service can write, only DR process can read
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ACCOUNT:role/backup-service"},
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::company-backups/postgres/*"
    },
    {
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::ACCOUNT:role/dr-restore-service"},
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::company-backups/postgres/*",
      "Condition": {
        "IpAddress": {"aws:SourceIp": "10.0.0.0/8"}  # Only from VPN
      }
    }
  ]
}

Backup retention and lifecycle:

# S3 lifecycle policy - automated tiering and deletion
aws s3api put-bucket-lifecycle-configuration \
    --bucket company-backups \
    --lifecycle-configuration file://lifecycle.json

{
  "Rules": [
    {
      "Id": "postgres-backup-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "postgres/full/"},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

Multi-Region Failover

Multi-region architecture provides resilience against entire region failures (data center outages, natural disasters, network partitions).

Active-Passive Failover

Architecture: Primary region handles all traffic. Secondary region is on standby, receives replicated data but serves no traffic until failover.

Pros:

Lower cost (DR resources can be scaled down)
Simpler management (single active region)
No split-brain risk (only one region writes)

Cons:

Slower failover (RTO: 5-30 minutes to promote standby)
Data loss possible (RPO: seconds to minutes depending on replication lag)
DR resources idle (not serving production traffic)

Implementation (PostgreSQL streaming replication):

# Primary database (us-east-1) - postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = on  # For lower RPO (slight performance impact)

# Replica database (us-west-2) - recovery.conf
primary_conninfo = 'host=db-primary.us-east-1.internal port=5432 user=replicator'
primary_slot_name = 'standby_slot'
hot_standby = on  # Allow read queries on replica

Failover process:

#!/bin/bash
# Automated failover to DR region

set -e

# 1. Detect primary failure (health check failed)
if ! pg_isready -h db-primary.us-east-1.internal; then
    echo "Primary database unhealthy - initiating failover"

    # 2. Promote replica to primary
    ssh dr-db.us-west-2.internal "pg_ctl promote -D /var/lib/postgresql/14/main"

    # 3. Wait for promotion
    sleep 10

    # 4. Verify replica is now accepting writes
    psql -h dr-db.us-west-2.internal -c "SELECT pg_is_in_recovery();"  # Should return false

    # 5. Update DNS to point to DR region (Route 53 example)
    aws route53 change-resource-record-sets \
        --hosted-zone-id Z123456 \
        --change-batch file://dns-failover.json

    # 6. Scale up application servers in DR region
    aws autoscaling set-desired-capacity \
        --auto-scaling-group-name payment-service-dr \
        --desired-capacity 10  # Match primary region capacity

    # 7. Verify traffic is flowing to DR region
    curl https://api.company.com/health

    echo "Failover complete - now serving from us-west-2"
else
    echo "Primary healthy - no action needed"
fi

DNS failover (Route 53):

{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.company.com",
      "Type": "A",
      "SetIdentifier": "Primary",
      "Failover": "PRIMARY",
      "TTL": 60,
      "ResourceRecords": [{"Value": "1.2.3.4"}],
      "HealthCheckId": "abc123"
    }
  }, {
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.company.com",
      "Type": "A",
      "SetIdentifier": "Secondary",
      "Failover": "SECONDARY",
      "TTL": 60,
      "ResourceRecords": [{"Value": "5.6.7.8"}]
    }
  }]
}

Active-Active Failover

Architecture: Both regions handle production traffic simultaneously. Users are routed to nearest region.

Pros:

Near-zero RTO (traffic automatically reroutes to healthy region)
Better user experience (users route to nearest region - lower latency)
Higher utilization (all resources serve production traffic)

Cons:

Most expensive (full capacity in both regions)
Complex data consistency (must handle write conflicts)
Risk of split-brain (network partition causes divergence)

Use case: Mission-critical services where downtime is not acceptable and latency matters (payment processing, authentication).

Write conflict resolution:

When both regions accept writes simultaneously, conflicts can occur:

-- Region 1: User updates email at 10:00:01
UPDATE users SET email = '[email protected]' WHERE id = 123;

-- Region 2: Same user updates email at 10:00:02 (before replication)
UPDATE users SET email = '[email protected]' WHERE id = 123;

-- Conflict! Which email is correct?

Conflict resolution strategies:

Last-write-wins (LWW): Use timestamp to determine winner

-- Schema includes timestamp for conflict resolution
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    updated_at TIMESTAMP DEFAULT NOW(),
    updated_in_region VARCHAR(50)
);

-- Conflict resolution: keep update with latest timestamp
-- If Region 2 update is 10:00:02 and Region 1 is 10:00:01, Region 2 wins

Application-specific logic: Domain knowledge determines winner

// Example: For user profiles, prefer update from user's home region
public class ConflictResolver {
    public User resolveConflict(User region1Version, User region2Version, User originalVersion) {
        // If user's home region is us-east-1, prefer changes from that region
        if (originalVersion.getHomeRegion().equals("us-east-1")) {
            return region1Version;
        } else {
            return region2Version;
        }
    }
}

CRDTs (Conflict-free Replicated Data Types): Mathematically guarantee eventual consistency

Recommendation: Avoid multi-master writes when possible. Use regional affinity (users always write to same region) to minimize conflicts.

Load Balancer Failover

Route traffic away from failed region using health checks:

Global load balancer (AWS Global Accelerator, Cloudflare Load Balancing):

global_load_balancer:
  endpoints:
    - region: us-east-1
      ip: 1.2.3.4
      weight: 50
      health_check:
        path: /health
        interval: 10s
        timeout: 5s
        unhealthy_threshold: 3

    - region: us-west-2
      ip: 5.6.7.8
      weight: 50
      health_check:
        path: /health
        interval: 10s
        timeout: 5s
        unhealthy_threshold: 3

  routing_policy: latency  # Route to region with lowest latency
  failover: automatic  # If health check fails, remove from pool

Health check endpoint:

@RestController
public class HealthController {

    @GetMapping("/health")
    public ResponseEntity<HealthStatus> health() {
        // Comprehensive health check
        boolean databaseHealthy = checkDatabase();
        boolean dependenciesHealthy = checkExternalDependencies();
        boolean resourcesHealthy = checkResourceAvailability();

        if (databaseHealthy && dependenciesHealthy && resourcesHealthy) {
            return ResponseEntity.ok(new HealthStatus("healthy"));
        } else {
            // Fail health check - load balancer will remove this region from pool
            return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE)
                .body(new HealthStatus("unhealthy"));
        }
    }

    private boolean checkDatabase() {
        try {
            dataSource.getConnection().close();
            return true;
        } catch (SQLException e) {
            return false;
        }
    }
}

Disaster Recovery Testing

DR plans that aren't tested regularly will fail when needed. Testing verifies your procedures work and trains your team.

Annual DR Test

Simulate complete disaster scenario annually (minimum):

annual_dr_test:
  objective: "Verify complete recovery from primary region failure"

  preparation:
    - schedule: "Schedule test 4 weeks in advance"
    - communication: "Notify all stakeholders (engineering, product, leadership)"
    - maintenance_window: "Schedule during low-traffic period (e.g., Sunday 2-6 AM)"
    - rollback_plan: "Document rollback procedure if test fails"

  test_scenario:
    - simulate: "Primary region (us-east-1) becomes unavailable"
    - method: "Disable primary region in load balancer (do NOT destroy infrastructure)"

  execution:
    - step1: "Trigger failover procedure at 2:00 AM"
    - step2: "Promote DR database to primary (us-west-2)"
    - step3: "Update DNS to point to DR region"
    - step4: "Scale up DR application servers to full capacity"
    - step5: "Verify traffic flowing to DR region"

  validation:
    - functional_tests: "Run automated smoke tests against DR region"
    - performance_tests: "Verify latency and throughput meet SLOs"
    - data_integrity: "Verify recent transactions present in DR database"
    - end_to_end: "Complete manual user journey (login, transaction, logout)"

  metrics:
    - rto_actual: "Measure time from failure to full recovery"
    - rpo_actual: "Measure data loss (transactions missing)"
    - issues_found: "Document any failures or gaps"

  post_test:
    - failback: "Return to primary region (tests failback procedure)"
    - retrospective: "Team meeting to review results"
    - action_items: "Update runbooks based on learnings"
    - next_test: "Schedule next test"

Quarterly Component Tests

Test individual components quarterly without full failover:

quarterly_component_tests:
  database_restore:
    frequency: "Quarterly"
    action: "Restore yesterday's backup to isolated test environment"
    validation: "Query key tables, verify row counts"
    duration: "2 hours"

  dns_failover:
    frequency: "Quarterly"
    action: "Update test subdomain (test.company.com) to point to DR region"
    validation: "Verify DNS propagation, test application access"
    duration: "30 minutes"

  runbook_walkthrough:
    frequency: "Quarterly"
    action: "On-call engineer executes runbook step-by-step (dry run)"
    validation: "Verify all commands work, all access credentials valid"
    duration: "1 hour"

Chaos Engineering for DR

Proactively inject failures to test resilience:

chaos_experiments:
  database_failover:
    description: "Simulate primary database failure during business hours"
    action: "Terminate primary database instance (in test environment)"
    expected_outcome: "Application automatically fails over to replica within 2 minutes"
    frequency: "Monthly"

  region_partition:
    description: "Simulate network partition between regions"
    action: "Block network traffic between us-east-1 and us-west-2"
    expected_outcome: "Each region continues serving traffic independently"
    frequency: "Quarterly"

  slow_replication:
    description: "Simulate replication lag"
    action: "Throttle network bandwidth to replication endpoint"
    expected_outcome: "Monitoring alerts on high replication lag, no user impact"
    frequency: "Quarterly"

Tools: AWS Fault Injection Simulator, Chaos Monkey, Gremlin

Runbooks for Common Disaster Scenarios

Runbooks provide step-by-step procedures for responding to specific disasters.

Runbook: Data Center Failure

# Runbook: Complete Data Center / Region Failure

## Symptoms
- All services in primary region (us-east-1) unreachable
- Health checks failing across all endpoints
- AWS dashboard shows region-wide issues

## Impact
- All users unable to access application
- RTO: 30 minutes
- RPO: 2 minutes (replication lag)

## Prerequisites
- Access to AWS console with cross-region permissions
- PagerDuty access
- Slack access (#incidents channel)

## Procedure

### Phase 1: Confirmation (5 minutes)

**1.1 Verify region is actually down (not network issue)**
```bash
# Test from multiple locations
curl -I https://api.company.com/health  # Your application
curl -I https://console.aws.amazon.com  # AWS itself
```

**1.2 Check AWS Service Health Dashboard**
- https://health.aws.amazon.com/health/status
- Verify us-east-1 has reported issues

**1.3 Declare incident**
```bash
# Create incident in PagerDuty
pd incident create --title "Region failure: us-east-1" --service payment-api

# Post to Slack
# #incidents: "@here P1 incident: us-east-1 region failure. Initiating DR failover."
```

### Phase 2: Failover to DR Region (20 minutes)

**2.1 Promote DR database (us-west-2)**
```bash
# SSH to DR database server
ssh dr-db.us-west-2.internal

# Promote from replica to primary
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/14/main

# Verify promotion
psql -c "SELECT pg_is_in_recovery();"  # Should return 'f' (false)
```

**2.2 Update DNS to DR region**
```bash
# Update Route 53 to point to DR region
aws route53 change-resource-record-sets \
    --hosted-zone-id ZONEID \
    --change-batch file://failover-dns.json

# Verify DNS update
dig api.company.com  # Should show DR region IP
```

**2.3 Scale up DR application servers**
```bash
# Increase capacity to match primary region
aws autoscaling set-desired-capacity \
    --auto-scaling-group-name payment-api-dr \
    --desired-capacity 20

# Monitor scaling
watch aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-names payment-api-dr
```

**2.4 Update configuration**
```bash
# Update feature flags to reflect DR mode
curl -X POST https://featureflags.company.com/api/flags \
    -d '{"flag": "use_dr_region", "value": true}'
```

### Phase 3: Validation (5 minutes)

**3.1 Smoke tests**
```bash
# Run automated smoke tests
npm run test:smoke:production

# Expected: All tests pass
```

**3.2 Manual verification**
```bash
# Create test payment
curl -X POST https://api.company.com/payments \
    -H "Authorization: Bearer $TEST_TOKEN" \
    -d '{"amount": 1.00, "currency": "USD"}'

# Verify in database
psql -h dr-db.us-west-2.internal -c \
    "SELECT * FROM payments ORDER BY created_at DESC LIMIT 5;"
```

**3.3 Monitor metrics**
- Check Grafana dashboard: https://grafana.company.com/d/dr-status
- Verify traffic flowing to us-west-2
- Verify error rate < 1%

### Phase 4: Communication

**4.1 Internal notification**
```
#incidents: "Failover complete. Now serving from us-west-2. Monitoring for issues."
```

**4.2 External notification (if downtime exceeded 15 min)**
- Update status page: https://status.company.com
- Email customers (use template: dr-failover-notification.html)

### Phase 5: Monitor and Prepare for Failback

**5.1 Continuous monitoring**
- Watch for elevated errors, latency spikes
- Monitor replication lag (if primary region comes back online)

**5.2 When primary region recovers**
- Do NOT immediately fail back
- Wait 24 hours to ensure primary region is stable
- Plan failback during maintenance window
- Execute failback runbook (separate document)

## Rollback
If DR region has issues:
```bash
# Attempt to bring primary region back online
# OR failover to tertiary region (eu-west-1) if configured
```

## Post-Incident
- [ ] Schedule post-mortem within 48 hours
- [ ] Update RTO/RPO actuals
- [ ] Document any gaps in runbook
- [ ] Verify backups from failed region are intact

## Contacts
- Incident Manager: [On-call via PagerDuty]
- AWS Support: 1-800-xxx-xxxx (Premium Support)
- Database DBA: @database-team (Slack)

Runbook: Ransomware Attack

# Runbook: Ransomware / Data Corruption Event

## Symptoms
- Unexpected file encryption
- Database tables dropped or corrupted
- Ransom note found in file system or database

## Impact
- Potential data loss
- Possible service disruption
- Security incident requiring legal/regulatory notification

## STOP - DO NOT
-  [BAD] Pay ransom (involves legal, FBI notification)
-  [BAD] Delete anything (preserve evidence)
-  [BAD] Restore from backup immediately (backups may also be infected)

## Procedure

### Phase 1: Contain (Immediate)

**1.1 Isolate affected systems**
```bash
# Disconnect from network to prevent spread
aws ec2 modify-instance-attribute \
    --instance-id i-affected123 \
    --no-source-dest-check

# Disable network interfaces
aws ec2 detach-network-interface --attachment-id eni-attach-123
```

**1.2 Preserve evidence**
```bash
# Create snapshots before ANY changes
aws ec2 create-snapshot --volume-id vol-affected123 \
    --description "Evidence - ransomware incident $(date)"
```

**1.3 Escalate immediately**
- Notify security team: @security-team
- Notify legal: [email protected]
- Create P1 incident in PagerDuty

### Phase 2: Assess (30 minutes)

**2.1 Determine scope**
- Which systems affected?
- When did encryption/corruption start?
- What data is impacted?

**2.2 Find clean backup**
```bash
# List backups chronologically
aws s3 ls s3://company-backups/postgres/full/ | sort -r

# Verify backup is clean (before infection)
# Restore to isolated environment and verify
```

### Phase 3: Recovery

**3.1 Restore from clean backup**
```bash
# Use backup from BEFORE ransomware infection
# Identified clean backup: backup_20240114_110000.dump (day before infection)

pg_restore --dbname=postgres --clean /backups/backup_20240114_110000.dump
```

**3.2 Replay clean transactions**
- If backup is 24 hours old, identify legitimate transactions from corrupted database
- Manually re-enter or use WAL replay (if WAL is clean)

### Phase 4: Security Remediation

**4.1 Rotate all credentials**
```bash
# Change all passwords, API keys, database credentials
# Assumption: attacker may have exfiltrated credentials
```

**4.2 Patch vulnerabilities**
- Identify how ransomware gained access
- Apply security patches
- Review firewall rules, access controls

### Phase 5: Post-Incident

**5.1 Regulatory notification**
- GDPR: 72-hour notification if personal data affected
- PCI-DSS: Immediate notification if payment data affected

**5.2 Customer communication**
- Determine which users' data was accessed/encrypted
- Send notification (legal review required)

**5.3 Post-mortem and prevention**
- How did ransomware gain access?
- Why weren't backups isolated?
- Implement immutable backups (S3 Object Lock)

Communication Plans

Clear communication during disasters minimizes panic and coordinates response.

Internal Communication

Stakeholders to notify:

Stakeholder	When to Notify	Channel	Information Needed
On-call engineers	Immediately (auto)	PagerDuty	Alert details, runbook link
Engineering team	Within 5 min (P1/P2)	Slack #incidents	What's wrong, who's responding, ETA
Engineering leadership	Within 15 min (P1)	Slack + Email	Business impact, customer effect, recovery plan
Product team	Within 30 min	Slack #product	User-facing features affected, customer impact
Customer support	Within 30 min (if users affected)	Slack #support	What to tell customers, expected resolution
Executive team	Within 1 hour (P1) or next business day (P2/P3)	Email	Business impact, revenue effect, recovery status
Legal/compliance	Immediately (data breach) or next business day (compliance-related)	Email + Phone	Regulatory implications, notification requirements

Communication template:

## Incident Update: [Title]

**Status**: [Investigating / Mitigating / Resolved]
**Severity**: [P1 / P2 / P3]
**Started**: [Timestamp]
**Last Updated**: [Timestamp]

**Impact**:
- [User-facing description of what's broken]
- [Estimated number of users affected]
- [Business impact, e.g., "Payment processing unavailable"]

**Actions Taken**:
- [Step 1 completed]
- [Step 2 in progress]

**Next Steps**:
- [What we're doing next]
- [ETA for resolution or next update]

**Who's Responding**:
- Incident Commander: @alice
- On-call: @bob
- Database: @carol

External Communication

Customer-facing status page:

## Payment Processing - Degraded Performance

**Current Status**: Investigating
**Started**: Jan 15, 2024 14:30 UTC
**Last Updated**: Jan 15, 2024 14:45 UTC

We are currently investigating elevated latency for payment processing. Some payment requests may take longer than usual or fail. Our team is actively working on a resolution.

**Next Update**: 15:00 UTC or when resolved

When to post external updates:

P1 incidents affecting >10% users: Update within 15 minutes, every 30 minutes until resolved
P2 incidents affecting >5% users: Update within 30 minutes, hourly until resolved
P3 incidents: No external update unless duration exceeds 4 hours

Post-resolution notification:

## [Resolved] Payment Processing - Degraded Performance

**Status**: Resolved
**Duration**: 47 minutes (14:30 - 15:17 UTC)

Payment processing has been fully restored. All payments are now processing normally.

**What Happened**:
Database connection pool exhaustion caused payment requests to queue. We increased connection pool size and restarted affected services.

**Impact**:
Approximately 15% of payment requests experienced elevated latency (5-10 seconds) or timeouts. No data was lost and all failed requests can be retried.

**Prevention**:
We are implementing automatic connection pool scaling and improved monitoring to detect this scenario earlier.

We apologize for the disruption.

Summary

Disaster recovery is insurance - you pay the cost hoping you never need it, but when disaster strikes, it's invaluable.

Key principles:

Define targets first: RTO and RPO drive all DR decisions and investments
Test regularly: Untested DR plans will fail when needed
Automate recovery: Manual procedures are slow and error-prone under pressure
Document thoroughly: Runbooks must be step-by-step and tested
Communicate clearly: Internal and external stakeholders need different information
Learn from incidents: Every disaster improves your DR capability

Implementation roadmap:

Define RTO/RPO for each service tier (critical, important, non-critical)
Implement backups matching RPO requirements (daily, hourly, continuous)
Test backup restores monthly (automated) and quarterly (full drill)
Build runbooks for top 5 most likely disaster scenarios
Simulate disasters annually with full team participation
Iterate and improve based on tests and real incidents

Remember: The best disaster recovery plan is one that's never needed, but when it is, executes flawlessly because it's been tested and refined over time.

Disaster Recovery and Business Continuity

What is Disaster Recovery?

RTO and RPO: Defining Recovery Targets

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

RTO vs RPO Trade-offs

Backup Strategies

Backup Types

Backup Strategy Selection Matrix

Backup Implementation Example (PostgreSQL)

Backup Testing and Validation

Database Backup and Restore Procedures

Point-in-Time Recovery (PITR)

Logical vs Physical Backups

Backup Encryption and Security

Multi-Region Failover

Active-Passive Failover

Active-Active Failover

Load Balancer Failover

Disaster Recovery Testing

Annual DR Test

Quarterly Component Tests

Chaos Engineering for DR

Runbooks for Common Disaster Scenarios

Runbook: Data Center Failure

Runbook: Ransomware Attack

Communication Plans

Internal Communication

External Communication

Further Reading

Internal Documentation

External Resources

Summary

What is Disaster Recovery?​

RTO and RPO: Defining Recovery Targets​

Recovery Time Objective (RTO)​

Recovery Point Objective (RPO)​

RTO vs RPO Trade-offs​

Backup Strategies​

Backup Types​

Backup Strategy Selection Matrix​

Backup Implementation Example (PostgreSQL)​

Backup Testing and Validation​

Database Backup and Restore Procedures​

Point-in-Time Recovery (PITR)​

Logical vs Physical Backups​

Backup Encryption and Security​

Multi-Region Failover​

Active-Passive Failover​

Active-Active Failover​

Load Balancer Failover​

Disaster Recovery Testing​

Annual DR Test​

Quarterly Component Tests​

Chaos Engineering for DR​

Runbooks for Common Disaster Scenarios​

Runbook: Data Center Failure​

Runbook: Ransomware Attack​

Communication Plans​

Internal Communication​

External Communication​

Further Reading​

Internal Documentation​

External Resources​

Summary​

What is Disaster Recovery?

RTO and RPO: Defining Recovery Targets

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

RTO vs RPO Trade-offs

Backup Strategies

Backup Types

Backup Strategy Selection Matrix

Backup Implementation Example (PostgreSQL)

Backup Testing and Validation

Database Backup and Restore Procedures

Point-in-Time Recovery (PITR)

Logical vs Physical Backups

Backup Encryption and Security

Multi-Region Failover

Active-Passive Failover

Active-Active Failover

Load Balancer Failover

Disaster Recovery Testing

Annual DR Test

Quarterly Component Tests

Chaos Engineering for DR

Runbooks for Common Disaster Scenarios

Runbook: Data Center Failure

Runbook: Ransomware Attack

Communication Plans

Internal Communication

External Communication

Further Reading

Internal Documentation

External Resources

Summary