Disaster Recovery and Business Continuity
What is Disaster Recovery?
Disaster Recovery (DR) is the process of restoring systems and data after a catastrophic failure. Business Continuity (BC) is the broader strategy ensuring business operations continue during and after a disaster.
Disaster scenarios:
- Infrastructure failures: Data center outage, network partition, cloud region failure
- Data loss: Database corruption, accidental deletion, ransomware encryption
- Application failures: Critical bug causing data corruption, cascading service failures
- Human error: Accidental configuration changes, deployment of broken code
- Security incidents: Ransomware attack, data breach, DDoS attack
- Natural disasters: Fire, flood, earthquake affecting physical infrastructure
The goal of DR is to minimize two critical metrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
RTO and RPO: Defining Recovery Targets
RTO and RPO define how much downtime and data loss are acceptable for your business. They drive all DR decisions and investments.
Recovery Time Objective (RTO)
Definition: Maximum acceptable time a system can be down after a disaster.
Question answered: "How long can we be offline before business impact is unacceptable?"
RTO Examples by Service Tier:
| Service Tier | RTO Target | Example Services | Implications |
|---|---|---|---|
| Mission Critical | < 1 hour | Payment processing, authentication | Multi-region active-active, automated failover, 24/7 on-call |
| Business Critical | 1-4 hours | Customer accounts, transaction history | Multi-region active-passive, documented runbooks, business hours support |
| Important | 4-24 hours | Reporting, analytics dashboards | Single region with backups, restore from backup acceptable |
| Non-Critical | 1-7 days | Internal tools, staging environments | Backup-only, manual rebuild acceptable |
Cost increases exponentially as RTO decreases:
Setting RTO:
Ask stakeholders:
- "How much revenue do we lose per hour of downtime?"
- "What regulatory requirements exist for uptime?"
- "What's the competitive impact of extended outages?"
- "What can we realistically afford to implement?"
Recovery Point Objective (RPO)
Definition: Maximum acceptable age of data after recovery (how much data loss is tolerable).
Question answered: "How much data can we afford to lose?"
Example: Your database backs up hourly. Disaster strikes at 3:45 PM. Last backup was 3:00 PM. You lose 45 minutes of data (all transactions between 3:00 and 3:45 PM).
RPO Examples by Data Type:
| Data Type | RPO Target | Backup Strategy | Example |
|---|---|---|---|
| Financial transactions | Near-zero (seconds) | Continuous replication, transaction logs | Payment records, account balances |
| User-generated content | < 5 minutes | Synchronous multi-region writes | User posts, messages, uploads |
| Operational data | < 1 hour | Asynchronous replication, hourly snapshots | Session data, cache, analytics |
| Historical/archival data | < 24 hours | Daily backups, archive storage | Old reports, audit logs |
| Derived/rebuildable data | No requirement | No backups (rebuild from source) | Caches, search indexes, aggregates |
Setting RPO:
Ask:
- "How much transaction data can we afford to lose?"
- "Can we reconstruct lost data from other sources?"
- "What are legal/regulatory requirements for data retention?"
- "What's the cost of data loss (customer trust, revenue, compliance)?"
RTO vs RPO Trade-offs
RTO and RPO are independent but related:
Quadrant 1 (Low RTO, Low RPO): Most expensive - requires active-active multi-region, continuous replication, automated failover
Quadrant 2 (High RTO, Low RPO): Data precious but downtime acceptable - focus on backups, can restore slowly
Quadrant 3 (High RTO, High RPO): Standard backup strategy - daily backups, manual restore
Quadrant 4 (Low RTO, High RPO): Fast recovery but data loss acceptable - stateless services, quick redeploy
Backup Strategies
Backups are the foundation of disaster recovery. Without backups, you have no recovery capability.
Backup Types
1. Full Backup
Copies all data every time. Simple but slow and storage-intensive.
Pros:
- Simple recovery (single backup contains everything)
- No dependency on previous backups
- Fastest restore (single source)
Cons:
- Slowest backup (copies everything every time)
- Highest storage cost (duplicates unchanged data)
- Resource-intensive (impacts production during backup)
Use case: Small databases (< 100 GB), weekly deep backups
2. Incremental Backup
Backs up only data changed since last backup (full or incremental).
Recovery: Restore full backup, then apply each incremental in sequence.
Pros:
- Fastest backups (only changed data)
- Lowest storage usage
- Minimal production impact
Cons:
- Slowest recovery (need full + all incrementals)
- Complex restore process
- Dependency chain (if one incremental is corrupted, later ones are unusable)
Use case: Large databases with frequent backups
3. Differential Backup
Backs up data changed since last full backup.
Recovery: Restore full backup + most recent differential (only 2 files).
Pros:
- Faster recovery than incremental (only 2 restore operations)
- Simpler than incremental (no chain dependency)
Cons:
- Slower backups than incremental (each differential grows over week)
- More storage than incremental
Use case: Balance between backup speed and recovery speed
4. Continuous Data Protection (CDP)
Captures every change as it happens (transaction log streaming, database replication).
Pros:
- Near-zero RPO (only lose in-flight transactions)
- Point-in-time recovery to any second
- Fastest recovery (replica is always current)
Cons:
- Most expensive (continuous streaming, storage, replica infrastructure)
- Complex to manage
- Requires specific database features (logical replication, WAL archiving)
Use case: Mission-critical systems with strict RPO requirements
Backup Strategy Selection Matrix
| Data Size | Change Rate | RPO Target | Recommended Strategy |
|---|---|---|---|
| < 100 GB | Low | 24 hours | Daily full backups |
| < 100 GB | High | < 1 hour | Hourly full backups + transaction logs |
| 100 GB - 1 TB | Low | 24 hours | Weekly full + daily differential |
| 100 GB - 1 TB | High | < 1 hour | Daily full + hourly incremental + transaction logs |
| > 1 TB | Low | 24 hours | Weekly full + daily incremental |
| > 1 TB | High | < 5 min | Continuous replication + transaction log streaming |
Backup Implementation Example (PostgreSQL)
#!/bin/bash
# Daily full backup with transaction log archiving
# Configuration
DB_NAME="payments_db"
BACKUP_DIR="/backups/postgres"
S3_BUCKET="s3://company-backups/postgres"
RETENTION_DAYS=30
# Timestamp for this backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_full_${TIMESTAMP}.dump"
# 1. Full backup using pg_dump
echo "Starting full backup..."
pg_dump \
--format=custom \
--compress=9 \
--file="${BACKUP_FILE}" \
--verbose \
"${DB_NAME}"
# 2. Verify backup integrity
echo "Verifying backup..."
pg_restore --list "${BACKUP_FILE}" > /dev/null
if [ $? -eq 0 ]; then
echo "✓ Backup verified successfully"
else
echo "✗ Backup verification failed!"
exit 1
fi
# 3. Upload to S3
echo "Uploading to S3..."
aws s3 cp "${BACKUP_FILE}" "${S3_BUCKET}/full/" \
--storage-class STANDARD_IA \
--metadata "retention-days=${RETENTION_DAYS},timestamp=${TIMESTAMP}"
# 4. Cleanup old local backups (keep last 7 days locally)
find "${BACKUP_DIR}" -name "*.dump" -mtime +7 -delete
# 5. Archive transaction logs (WAL) for point-in-time recovery
# Configured in postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://company-backups/postgres/wal/%f'
echo "Backup completed: ${BACKUP_FILE}"
Transaction log archiving (postgresql.conf):
# Enable WAL archiving for point-in-time recovery
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://company-backups/postgres/wal/%f'
archive_timeout = 300 # Archive every 5 minutes even if WAL not full
# Replication for near-zero RPO
max_wal_senders = 5
wal_keep_size = 1GB
Backup Testing and Validation
Critical principle: Untested backups are not backups. You must regularly verify you can restore.
Backup testing schedule:
backup_testing:
daily:
- name: "Automated integrity check"
action: "Verify backup file is not corrupted"
tool: "pg_restore --list"
weekly:
- name: "Sample restore to staging"
action: "Restore random backup to staging environment"
validation: "Run smoke tests, query key tables"
monthly:
- name: "Full restore drill"
action: "Complete restore from backup to isolated environment"
validation: "Application teams verify data integrity"
quarterly:
- name: "Disaster recovery simulation"
action: "Simulate complete data center failure"
validation: "Restore and failover to DR region, end-to-end testing"
Automated restore verification:
import boto3
import subprocess
from datetime import datetime, timedelta
def verify_backup(backup_file):
"""Automated backup verification process"""
# 1. Download backup from S3
s3 = boto3.client('s3')
s3.download_file('company-backups', f'postgres/full/{backup_file}', f'/tmp/{backup_file}')
# 2. Verify integrity
result = subprocess.run(
['pg_restore', '--list', f'/tmp/{backup_file}'],
capture_output=True
)
if result.returncode != 0:
alert(f"Backup verification failed: {backup_file}")
return False
# 3. Restore to test database
subprocess.run([
'pg_restore',
'--dbname=test_restore_db',
'--clean',
'--if-exists',
f'/tmp/{backup_file}'
])
# 4. Run data integrity checks
checks = [
"SELECT COUNT(*) FROM payments", # Ensure tables exist
"SELECT MAX(created_at) FROM payments", # Check data freshness
"SELECT COUNT(*) FROM users WHERE email IS NULL" # Constraint validation
]
for check in checks:
result = run_query('test_restore_db', check)
if not validate_result(result):
alert(f"Data integrity check failed: {check}")
return False
return True
Database Backup and Restore Procedures
Databases require special consideration for backups due to transaction consistency, referential integrity, and large data volumes.
Point-in-Time Recovery (PITR)
PITR enables restoring to any specific moment, not just backup snapshots. This is critical when:
- Recovering from data corruption (restore to moment before corruption)
- Investigating security incidents (restore to specific time to examine data)
- Recovering from accidental deletion (restore to 5 minutes before DROP TABLE)
How PITR works:
PostgreSQL PITR restore:
#!/bin/bash
# Point-in-time recovery to specific timestamp
TARGET_TIME="2024-01-15 14:29:00" # 1 minute before disaster
BACKUP_FILE="payments_db_full_20240114_230000.dump"
WAL_ARCHIVE="s3://company-backups/postgres/wal/"
# 1. Stop PostgreSQL
systemctl stop postgresql
# 2. Remove corrupted data directory
rm -rf /var/lib/postgresql/14/main/*
# 3. Restore base backup
pg_restore \
--dbname=postgres \
--create \
"/backups/${BACKUP_FILE}"
# 4. Configure recovery
cat > /var/lib/postgresql/14/main/recovery.conf << EOF
restore_command = 'aws s3 cp ${WAL_ARCHIVE}%f %p'
recovery_target_time = '${TARGET_TIME}'
recovery_target_action = 'promote'
EOF
# 5. Start PostgreSQL - it will replay WAL to target time
systemctl start postgresql
# 6. Verify recovery
psql -c "SELECT NOW(), MAX(created_at) FROM payments;"
Logical vs Physical Backups
Logical Backup (pg_dump, mysqldump):
Exports data as SQL statements or custom format.
-- Example logical backup output
CREATE TABLE payments (
id SERIAL PRIMARY KEY,
amount DECIMAL(10,2),
created_at TIMESTAMP
);
INSERT INTO payments VALUES (1, 99.99, '2024-01-15 10:00:00');
INSERT INTO payments VALUES (2, 149.50, '2024-01-15 10:05:00');
Pros:
- Portable across PostgreSQL versions
- Human-readable (SQL format)
- Can restore individual tables
- Platform-independent
Cons:
- Slower for large databases
- Larger backup files
- Cannot do PITR
- Must replay all INSERT statements during restore
Physical Backup (pg_basebackup, filesystem snapshots):
Copies actual database files (data pages, indexes, transaction logs).
Pros:
- Faster backup and restore (file copy, not SQL replay)
- Supports PITR (when combined with WAL archiving)
- Exact replica of database state
Cons:
- Version-specific (PostgreSQL 14 backup won't restore to PostgreSQL 15)
- All-or-nothing (can't restore single table)
- Requires identical architecture (can't restore from Linux to Windows)
Recommendation: Use both:
- Physical backups for disaster recovery (fast restore, PITR)
- Logical backups for migrations, development, selective restore
Backup Encryption and Security
Backups often contain sensitive data and must be secured.
Encryption at rest:
# Encrypt backup before uploading to S3
pg_dump payments_db | \
gzip | \
openssl enc -aes-256-cbc -salt -pbkdf2 -pass file:/etc/backup-encryption-key | \
aws s3 cp - s3://company-backups/postgres/encrypted/backup_$(date +%Y%m%d).dump.gz.enc
S3 server-side encryption:
# Let AWS handle encryption (simpler, managed keys)
aws s3 cp backup.dump s3://company-backups/postgres/ \
--server-side-encryption AES256 \
--storage-class STANDARD_IA
Access controls:
# IAM policy - only backup service can write, only DR process can read
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::ACCOUNT:role/backup-service"},
"Action": ["s3:PutObject"],
"Resource": "arn:aws:s3:::company-backups/postgres/*"
},
{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::ACCOUNT:role/dr-restore-service"},
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-backups/postgres/*",
"Condition": {
"IpAddress": {"aws:SourceIp": "10.0.0.0/8"} # Only from VPN
}
}
]
}
Backup retention and lifecycle:
# S3 lifecycle policy - automated tiering and deletion
aws s3api put-bucket-lifecycle-configuration \
--bucket company-backups \
--lifecycle-configuration file://lifecycle.json
{
"Rules": [
{
"Id": "postgres-backup-lifecycle",
"Status": "Enabled",
"Filter": {"Prefix": "postgres/full/"},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}
Multi-Region Failover
Multi-region architecture provides resilience against entire region failures (data center outages, natural disasters, network partitions).
Active-Passive Failover
Architecture: Primary region handles all traffic. Secondary region is on standby, receives replicated data but serves no traffic until failover.
Pros:
- Lower cost (DR resources can be scaled down)
- Simpler management (single active region)
- No split-brain risk (only one region writes)
Cons:
- Slower failover (RTO: 5-30 minutes to promote standby)
- Data loss possible (RPO: seconds to minutes depending on replication lag)
- DR resources idle (not serving production traffic)
Implementation (PostgreSQL streaming replication):
# Primary database (us-east-1) - postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
synchronous_commit = on # For lower RPO (slight performance impact)
# Replica database (us-west-2) - recovery.conf
primary_conninfo = 'host=db-primary.us-east-1.internal port=5432 user=replicator'
primary_slot_name = 'standby_slot'
hot_standby = on # Allow read queries on replica
Failover process:
#!/bin/bash
# Automated failover to DR region
set -e
# 1. Detect primary failure (health check failed)
if ! pg_isready -h db-primary.us-east-1.internal; then
echo "Primary database unhealthy - initiating failover"
# 2. Promote replica to primary
ssh dr-db.us-west-2.internal "pg_ctl promote -D /var/lib/postgresql/14/main"
# 3. Wait for promotion
sleep 10
# 4. Verify replica is now accepting writes
psql -h dr-db.us-west-2.internal -c "SELECT pg_is_in_recovery();" # Should return false
# 5. Update DNS to point to DR region (Route 53 example)
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch file://dns-failover.json
# 6. Scale up application servers in DR region
aws autoscaling set-desired-capacity \
--auto-scaling-group-name payment-service-dr \
--desired-capacity 10 # Match primary region capacity
# 7. Verify traffic is flowing to DR region
curl https://api.company.com/health
echo "Failover complete - now serving from us-west-2"
else
echo "Primary healthy - no action needed"
fi
DNS failover (Route 53):
{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.company.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"TTL": 60,
"ResourceRecords": [{"Value": "1.2.3.4"}],
"HealthCheckId": "abc123"
}
}, {
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.company.com",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"TTL": 60,
"ResourceRecords": [{"Value": "5.6.7.8"}]
}
}]
}
Active-Active Failover
Architecture: Both regions handle production traffic simultaneously. Users are routed to nearest region.
Pros:
- Near-zero RTO (traffic automatically reroutes to healthy region)
- Better user experience (users route to nearest region - lower latency)
- Higher utilization (all resources serve production traffic)
Cons:
- Most expensive (full capacity in both regions)
- Complex data consistency (must handle write conflicts)
- Risk of split-brain (network partition causes divergence)
Use case: Mission-critical services where downtime is not acceptable and latency matters (payment processing, authentication).
Write conflict resolution:
When both regions accept writes simultaneously, conflicts can occur:
-- Region 1: User updates email at 10:00:01
UPDATE users SET email = '[email protected]' WHERE id = 123;
-- Region 2: Same user updates email at 10:00:02 (before replication)
UPDATE users SET email = '[email protected]' WHERE id = 123;
-- Conflict! Which email is correct?
Conflict resolution strategies:
- Last-write-wins (LWW): Use timestamp to determine winner
-- Schema includes timestamp for conflict resolution
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255),
updated_at TIMESTAMP DEFAULT NOW(),
updated_in_region VARCHAR(50)
);
-- Conflict resolution: keep update with latest timestamp
-- If Region 2 update is 10:00:02 and Region 1 is 10:00:01, Region 2 wins
- Application-specific logic: Domain knowledge determines winner
// Example: For user profiles, prefer update from user's home region
public class ConflictResolver {
public User resolveConflict(User region1Version, User region2Version, User originalVersion) {
// If user's home region is us-east-1, prefer changes from that region
if (originalVersion.getHomeRegion().equals("us-east-1")) {
return region1Version;
} else {
return region2Version;
}
}
}
- CRDTs (Conflict-free Replicated Data Types): Mathematically guarantee eventual consistency
Recommendation: Avoid multi-master writes when possible. Use regional affinity (users always write to same region) to minimize conflicts.
Load Balancer Failover
Route traffic away from failed region using health checks:
Global load balancer (AWS Global Accelerator, Cloudflare Load Balancing):
global_load_balancer:
endpoints:
- region: us-east-1
ip: 1.2.3.4
weight: 50
health_check:
path: /health
interval: 10s
timeout: 5s
unhealthy_threshold: 3
- region: us-west-2
ip: 5.6.7.8
weight: 50
health_check:
path: /health
interval: 10s
timeout: 5s
unhealthy_threshold: 3
routing_policy: latency # Route to region with lowest latency
failover: automatic # If health check fails, remove from pool
Health check endpoint:
@RestController
public class HealthController {
@GetMapping("/health")
public ResponseEntity<HealthStatus> health() {
// Comprehensive health check
boolean databaseHealthy = checkDatabase();
boolean dependenciesHealthy = checkExternalDependencies();
boolean resourcesHealthy = checkResourceAvailability();
if (databaseHealthy && dependenciesHealthy && resourcesHealthy) {
return ResponseEntity.ok(new HealthStatus("healthy"));
} else {
// Fail health check - load balancer will remove this region from pool
return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE)
.body(new HealthStatus("unhealthy"));
}
}
private boolean checkDatabase() {
try {
dataSource.getConnection().close();
return true;
} catch (SQLException e) {
return false;
}
}
}
Disaster Recovery Testing
DR plans that aren't tested regularly will fail when needed. Testing verifies your procedures work and trains your team.
Annual DR Test
Simulate complete disaster scenario annually (minimum):
annual_dr_test:
objective: "Verify complete recovery from primary region failure"
preparation:
- schedule: "Schedule test 4 weeks in advance"
- communication: "Notify all stakeholders (engineering, product, leadership)"
- maintenance_window: "Schedule during low-traffic period (e.g., Sunday 2-6 AM)"
- rollback_plan: "Document rollback procedure if test fails"
test_scenario:
- simulate: "Primary region (us-east-1) becomes unavailable"
- method: "Disable primary region in load balancer (do NOT destroy infrastructure)"
execution:
- step1: "Trigger failover procedure at 2:00 AM"
- step2: "Promote DR database to primary (us-west-2)"
- step3: "Update DNS to point to DR region"
- step4: "Scale up DR application servers to full capacity"
- step5: "Verify traffic flowing to DR region"
validation:
- functional_tests: "Run automated smoke tests against DR region"
- performance_tests: "Verify latency and throughput meet SLOs"
- data_integrity: "Verify recent transactions present in DR database"
- end_to_end: "Complete manual user journey (login, transaction, logout)"
metrics:
- rto_actual: "Measure time from failure to full recovery"
- rpo_actual: "Measure data loss (transactions missing)"
- issues_found: "Document any failures or gaps"
post_test:
- failback: "Return to primary region (tests failback procedure)"
- retrospective: "Team meeting to review results"
- action_items: "Update runbooks based on learnings"
- next_test: "Schedule next test"
Quarterly Component Tests
Test individual components quarterly without full failover:
quarterly_component_tests:
database_restore:
frequency: "Quarterly"
action: "Restore yesterday's backup to isolated test environment"
validation: "Query key tables, verify row counts"
duration: "2 hours"
dns_failover:
frequency: "Quarterly"
action: "Update test subdomain (test.company.com) to point to DR region"
validation: "Verify DNS propagation, test application access"
duration: "30 minutes"
runbook_walkthrough:
frequency: "Quarterly"
action: "On-call engineer executes runbook step-by-step (dry run)"
validation: "Verify all commands work, all access credentials valid"
duration: "1 hour"
Chaos Engineering for DR
Proactively inject failures to test resilience:
chaos_experiments:
database_failover:
description: "Simulate primary database failure during business hours"
action: "Terminate primary database instance (in test environment)"
expected_outcome: "Application automatically fails over to replica within 2 minutes"
frequency: "Monthly"
region_partition:
description: "Simulate network partition between regions"
action: "Block network traffic between us-east-1 and us-west-2"
expected_outcome: "Each region continues serving traffic independently"
frequency: "Quarterly"
slow_replication:
description: "Simulate replication lag"
action: "Throttle network bandwidth to replication endpoint"
expected_outcome: "Monitoring alerts on high replication lag, no user impact"
frequency: "Quarterly"
Tools: AWS Fault Injection Simulator, Chaos Monkey, Gremlin
Runbooks for Common Disaster Scenarios
Runbooks provide step-by-step procedures for responding to specific disasters.
Runbook: Data Center Failure
# Runbook: Complete Data Center / Region Failure
## Symptoms
- All services in primary region (us-east-1) unreachable
- Health checks failing across all endpoints
- AWS dashboard shows region-wide issues
## Impact
- All users unable to access application
- RTO: 30 minutes
- RPO: 2 minutes (replication lag)
## Prerequisites
- Access to AWS console with cross-region permissions
- PagerDuty access
- Slack access (#incidents channel)
## Procedure
### Phase 1: Confirmation (5 minutes)
**1.1 Verify region is actually down (not network issue)**
```bash
# Test from multiple locations
curl -I https://api.company.com/health # Your application
curl -I https://console.aws.amazon.com # AWS itself
```
**1.2 Check AWS Service Health Dashboard**
- https://health.aws.amazon.com/health/status
- Verify us-east-1 has reported issues
**1.3 Declare incident**
```bash
# Create incident in PagerDuty
pd incident create --title "Region failure: us-east-1" --service payment-api
# Post to Slack
# #incidents: "@here P1 incident: us-east-1 region failure. Initiating DR failover."
```
### Phase 2: Failover to DR Region (20 minutes)
**2.1 Promote DR database (us-west-2)**
```bash
# SSH to DR database server
ssh dr-db.us-west-2.internal
# Promote from replica to primary
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/14/main
# Verify promotion
psql -c "SELECT pg_is_in_recovery();" # Should return 'f' (false)
```
**2.2 Update DNS to DR region**
```bash
# Update Route 53 to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id ZONEID \
--change-batch file://failover-dns.json
# Verify DNS update
dig api.company.com # Should show DR region IP
```
**2.3 Scale up DR application servers**
```bash
# Increase capacity to match primary region
aws autoscaling set-desired-capacity \
--auto-scaling-group-name payment-api-dr \
--desired-capacity 20
# Monitor scaling
watch aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names payment-api-dr
```
**2.4 Update configuration**
```bash
# Update feature flags to reflect DR mode
curl -X POST https://featureflags.company.com/api/flags \
-d '{"flag": "use_dr_region", "value": true}'
```
### Phase 3: Validation (5 minutes)
**3.1 Smoke tests**
```bash
# Run automated smoke tests
npm run test:smoke:production
# Expected: All tests pass
```
**3.2 Manual verification**
```bash
# Create test payment
curl -X POST https://api.company.com/payments \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"amount": 1.00, "currency": "USD"}'
# Verify in database
psql -h dr-db.us-west-2.internal -c \
"SELECT * FROM payments ORDER BY created_at DESC LIMIT 5;"
```
**3.3 Monitor metrics**
- Check Grafana dashboard: https://grafana.company.com/d/dr-status
- Verify traffic flowing to us-west-2
- Verify error rate < 1%
### Phase 4: Communication
**4.1 Internal notification**
```
#incidents: "Failover complete. Now serving from us-west-2. Monitoring for issues."
```
**4.2 External notification (if downtime exceeded 15 min)**
- Update status page: https://status.company.com
- Email customers (use template: dr-failover-notification.html)
### Phase 5: Monitor and Prepare for Failback
**5.1 Continuous monitoring**
- Watch for elevated errors, latency spikes
- Monitor replication lag (if primary region comes back online)
**5.2 When primary region recovers**
- Do NOT immediately fail back
- Wait 24 hours to ensure primary region is stable
- Plan failback during maintenance window
- Execute failback runbook (separate document)
## Rollback
If DR region has issues:
```bash
# Attempt to bring primary region back online
# OR failover to tertiary region (eu-west-1) if configured
```
## Post-Incident
- [ ] Schedule post-mortem within 48 hours
- [ ] Update RTO/RPO actuals
- [ ] Document any gaps in runbook
- [ ] Verify backups from failed region are intact
## Contacts
- Incident Manager: [On-call via PagerDuty]
- AWS Support: 1-800-xxx-xxxx (Premium Support)
- Database DBA: @database-team (Slack)
Runbook: Ransomware Attack
# Runbook: Ransomware / Data Corruption Event
## Symptoms
- Unexpected file encryption
- Database tables dropped or corrupted
- Ransom note found in file system or database
## Impact
- Potential data loss
- Possible service disruption
- Security incident requiring legal/regulatory notification
## STOP - DO NOT
- [BAD] Pay ransom (involves legal, FBI notification)
- [BAD] Delete anything (preserve evidence)
- [BAD] Restore from backup immediately (backups may also be infected)
## Procedure
### Phase 1: Contain (Immediate)
**1.1 Isolate affected systems**
```bash
# Disconnect from network to prevent spread
aws ec2 modify-instance-attribute \
--instance-id i-affected123 \
--no-source-dest-check
# Disable network interfaces
aws ec2 detach-network-interface --attachment-id eni-attach-123
```
**1.2 Preserve evidence**
```bash
# Create snapshots before ANY changes
aws ec2 create-snapshot --volume-id vol-affected123 \
--description "Evidence - ransomware incident $(date)"
```
**1.3 Escalate immediately**
- Notify security team: @security-team
- Notify legal: [email protected]
- Create P1 incident in PagerDuty
### Phase 2: Assess (30 minutes)
**2.1 Determine scope**
- Which systems affected?
- When did encryption/corruption start?
- What data is impacted?
**2.2 Find clean backup**
```bash
# List backups chronologically
aws s3 ls s3://company-backups/postgres/full/ | sort -r
# Verify backup is clean (before infection)
# Restore to isolated environment and verify
```
### Phase 3: Recovery
**3.1 Restore from clean backup**
```bash
# Use backup from BEFORE ransomware infection
# Identified clean backup: backup_20240114_110000.dump (day before infection)
pg_restore --dbname=postgres --clean /backups/backup_20240114_110000.dump
```
**3.2 Replay clean transactions**
- If backup is 24 hours old, identify legitimate transactions from corrupted database
- Manually re-enter or use WAL replay (if WAL is clean)
### Phase 4: Security Remediation
**4.1 Rotate all credentials**
```bash
# Change all passwords, API keys, database credentials
# Assumption: attacker may have exfiltrated credentials
```
**4.2 Patch vulnerabilities**
- Identify how ransomware gained access
- Apply security patches
- Review firewall rules, access controls
### Phase 5: Post-Incident
**5.1 Regulatory notification**
- GDPR: 72-hour notification if personal data affected
- PCI-DSS: Immediate notification if payment data affected
**5.2 Customer communication**
- Determine which users' data was accessed/encrypted
- Send notification (legal review required)
**5.3 Post-mortem and prevention**
- How did ransomware gain access?
- Why weren't backups isolated?
- Implement immutable backups (S3 Object Lock)
Communication Plans
Clear communication during disasters minimizes panic and coordinates response.
Internal Communication
Stakeholders to notify:
| Stakeholder | When to Notify | Channel | Information Needed |
|---|---|---|---|
| On-call engineers | Immediately (auto) | PagerDuty | Alert details, runbook link |
| Engineering team | Within 5 min (P1/P2) | Slack #incidents | What's wrong, who's responding, ETA |
| Engineering leadership | Within 15 min (P1) | Slack + Email | Business impact, customer effect, recovery plan |
| Product team | Within 30 min | Slack #product | User-facing features affected, customer impact |
| Customer support | Within 30 min (if users affected) | Slack #support | What to tell customers, expected resolution |
| Executive team | Within 1 hour (P1) or next business day (P2/P3) | Business impact, revenue effect, recovery status | |
| Legal/compliance | Immediately (data breach) or next business day (compliance-related) | Email + Phone | Regulatory implications, notification requirements |
Communication template:
## Incident Update: [Title]
**Status**: [Investigating / Mitigating / Resolved]
**Severity**: [P1 / P2 / P3]
**Started**: [Timestamp]
**Last Updated**: [Timestamp]
**Impact**:
- [User-facing description of what's broken]
- [Estimated number of users affected]
- [Business impact, e.g., "Payment processing unavailable"]
**Actions Taken**:
- [Step 1 completed]
- [Step 2 in progress]
**Next Steps**:
- [What we're doing next]
- [ETA for resolution or next update]
**Who's Responding**:
- Incident Commander: @alice
- On-call: @bob
- Database: @carol
External Communication
Customer-facing status page:
## Payment Processing - Degraded Performance
**Current Status**: Investigating
**Started**: Jan 15, 2024 14:30 UTC
**Last Updated**: Jan 15, 2024 14:45 UTC
We are currently investigating elevated latency for payment processing. Some payment requests may take longer than usual or fail. Our team is actively working on a resolution.
**Next Update**: 15:00 UTC or when resolved
When to post external updates:
- P1 incidents affecting >10% users: Update within 15 minutes, every 30 minutes until resolved
- P2 incidents affecting >5% users: Update within 30 minutes, hourly until resolved
- P3 incidents: No external update unless duration exceeds 4 hours
Post-resolution notification:
## [Resolved] Payment Processing - Degraded Performance
**Status**: Resolved
**Duration**: 47 minutes (14:30 - 15:17 UTC)
Payment processing has been fully restored. All payments are now processing normally.
**What Happened**:
Database connection pool exhaustion caused payment requests to queue. We increased connection pool size and restarted affected services.
**Impact**:
Approximately 15% of payment requests experienced elevated latency (5-10 seconds) or timeouts. No data was lost and all failed requests can be retried.
**Prevention**:
We are implementing automatic connection pool scaling and improved monitoring to detect this scenario earlier.
We apologize for the disruption.
Further Reading
Internal Documentation
- Monitoring and Alerting Strategy - SLI/SLO, error budgets, on-call practices
- Database Design - Replication, backup strategies, ACID guarantees
- Incident Post-Mortems - Blameless retrospectives, learning from incidents
- Spring Boot Resilience - Circuit breakers, retries, timeouts
External Resources
- AWS Disaster Recovery Whitepaper - Comprehensive DR strategies for AWS
- PostgreSQL High Availability - Replication, failover, backup strategies
- Google SRE Book - Disaster Recovery - Data integrity and disaster recovery best practices
- AWS Well-Architected Framework - Reliability Pillar - Backup, recovery, testing
- NIST Contingency Planning Guide - Comprehensive contingency planning framework
Summary
Disaster recovery is insurance - you pay the cost hoping you never need it, but when disaster strikes, it's invaluable.
Key principles:
- Define targets first: RTO and RPO drive all DR decisions and investments
- Test regularly: Untested DR plans will fail when needed
- Automate recovery: Manual procedures are slow and error-prone under pressure
- Document thoroughly: Runbooks must be step-by-step and tested
- Communicate clearly: Internal and external stakeholders need different information
- Learn from incidents: Every disaster improves your DR capability
Implementation roadmap:
- Define RTO/RPO for each service tier (critical, important, non-critical)
- Implement backups matching RPO requirements (daily, hourly, continuous)
- Test backup restores monthly (automated) and quarterly (full drill)
- Build runbooks for top 5 most likely disaster scenarios
- Simulate disasters annually with full team participation
- Iterate and improve based on tests and real incidents
Remember: The best disaster recovery plan is one that's never needed, but when it is, executes flawlessly because it's been tested and refined over time.