Chaos Engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. Rather than waiting for failures to occur naturally, chaos engineering proactively introduces failures in a controlled manner to identify weaknesses before they cause incidents.

Platform Applicability

Chaos engineering applies to backend services and distributed systems. This guide focuses on backend implementation patterns.

Overview

Systems fail. Networks experience latency, services crash, databases become unavailable, and infrastructure degrades. Traditional testing validates that systems work under ideal conditions, but production is rarely ideal. Chaos engineering answers a critical question: "What happens when things go wrong?"

The practice emerged from Netflix's need to ensure service reliability despite frequent failures in cloud infrastructure. By deliberately injecting failures into production systems, they discovered and fixed issues before customers encountered them. This proactive approach to reliability has become essential for distributed systems.

Why chaos engineering matters:

Modern distributed systems are too complex to reason about completely. A typical microservices architecture might have dozens of services, each with multiple instances, running across multiple availability zones, with various caching layers, message queues, and databases. Predicting all failure modes through static analysis or testing is impossible. Chaos engineering provides empirical evidence of how your system actually behaves under failure conditions.

What chaos engineering is not:

Not random destruction or "breaking things in production for fun"
Not a substitute for testing or monitoring
Not about causing customer-facing incidents
Not about finding someone to blame when failures are discovered

Chaos engineering is a scientific, controlled approach to building confidence in system resilience.

Core Principles

Chaos engineering operates on a scientific method applied to system reliability. Understanding these principles guides how experiments are designed, executed, and evaluated. The five core principles form the foundation of effective chaos engineering practice.

Build a Hypothesis Around Steady State

Before introducing chaos, establish what "normal" looks like for your system. Steady state is defined by measurable outputs that indicate normal system behavior, not internal system attributes. Focus on metrics that represent user-facing behavior: response times, success rates, throughput, error rates.

Rather than hypothesizing "the database will stay online during network partition," frame it as "API response times will remain under 200ms and error rates below 0.1% even if the database connection experiences intermittent failures." This framing focuses on observable user impact rather than internal component state.

Why steady state matters:

Systems are complex and often have redundancy, failover mechanisms, and retry logic that masks internal failures from users. By focusing on steady state behavior, you measure what actually matters: whether users can complete their tasks successfully. A database might briefly fail, but if connection pooling, retries, and circuit breakers prevent user impact, the system remains in steady state.

Understanding steady state requires baseline metrics. Before running experiments, collect at least a week of production metrics to understand normal variation. Systems exhibit daily and weekly patterns - traffic spikes at certain times, batch jobs run overnight, deployments occur during change windows. Your steady state definition must account for this natural variation.

Defining steady state metrics:

public class SteadyStateMetrics {
    // User-facing metrics that define system health
    private double requestSuccessRate;      // Should be > 99.9%
    private Duration p95ResponseTime;       // Should be < 200ms
    private double throughputRps;           // Should be within 20% of baseline
    private double errorRate;               // Should be < 0.1%

    public boolean isInSteadyState() {
        return requestSuccessRate > 0.999
            && p95ResponseTime.toMillis() < 200
            && errorRate < 0.001
            && throughputRps > baselineThroughput * 0.8;
    }
}

This code demonstrates how to programmatically verify steady state. The metrics are objective, measurable, and focused on user experience rather than internal system state.

Vary Real-World Events

Chaos experiments should reflect failures that actually occur in production environments. Synthetic failures that never happen in practice provide little value. Focus on events that have occurred or are likely to occur based on the realities of distributed systems.

Common real-world events:

Network issues: Latency spikes (congestion, distance), packet loss (faulty hardware, wireless), network partitions (switch failures, routing issues), DNS failures (misconfiguration, resolver issues)
Resource exhaustion: CPU saturation (traffic spikes, inefficient code), memory leaks (unbounded caches, resource leaks), disk space depletion (log growth, data accumulation), connection pool exhaustion (slow queries, connection leaks)
Service failures: Dependent service unavailability (crashes, deployments), slow responses (database contention, CPU saturation), error responses (bugs, capacity limits), cascading failures (timeout propagation)
Infrastructure failures: Instance termination (hardware failure, autoscaling), availability zone outages (power, cooling, network), region failures (natural disasters, large-scale outages), deployment failures (configuration errors, version incompatibilities)

These events occur due to hardware failures, software bugs, misconfigurations, capacity limits, and operational errors. By systematically introducing these failures, you validate that your system handles them gracefully.

Example hypothesis:

"When the payment service experiences 500ms latency, checkout completion rates will remain above 95% because the frontend implements proper timeout handling and user feedback."

This hypothesis is specific, measurable, and based on a real-world failure mode (downstream service latency). The success criteria (95% completion rate) is tied to business impact, not technical metrics.

Run Experiments in Production

The most valuable chaos experiments run in production because that's where real traffic patterns, data volumes, and system interactions exist. Staging environments approximate production but never fully replicate it - different data distribution, lower traffic volume, simplified infrastructure, and different usage patterns all mean that failure modes discovered in staging may not reflect production behavior.

Why production matters:

Production has characteristics that cannot be replicated elsewhere:

Real data: Production data distributions often trigger edge cases that test data doesn't. A specific customer record might trigger a query that performs poorly, or certain transaction patterns might cause deadlocks.
Actual scale: Concurrency bugs, race conditions, and resource exhaustion often only appear at production traffic volumes.
True dependencies: Production dependencies behave differently than staging - caches have different hit rates, databases have different query patterns, external APIs have rate limits.
Real user behavior: Users do unexpected things. They retry failed operations, use features in unusual combinations, and have unpredictable access patterns.

However, running experiments in production requires careful risk management. Start with minimal blast radius (affecting only a small percentage of traffic or specific regions), comprehensive monitoring to detect unexpected impacts, and immediate rollback capability.

Risk mitigation strategies:

This workflow ensures that production experiments are conducted safely. Each step builds confidence before expanding scope.

Production experiments reveal issues that only appear under real conditions: specific data patterns that trigger bugs, race conditions that occur at production scale, or dependencies that behave differently under actual load. A database query might perform fine with 1,000 test records but become problematic with 10 million production records. These insights are invaluable.

Automate Experiments to Run Continuously

Manual chaos experiments provide value but don't scale. Continuous chaos engineering integrates experiments into regular operations - running automatically during deployments, on schedules, or continuously at low levels.

Why automation matters:

Systems evolve constantly. A system that handled database failures gracefully six months ago might not after a refactoring that changed connection pool configuration. New features introduce new dependencies. Performance optimizations change retry behavior. Without continuous validation, resilience degrades silently.

This integration treats chaos experiments like automated tests. Just as you wouldn't deploy code without running tests, mature chaos engineering practices don't deploy without validating resilience.

Continuous chaos implementation:

# .gitlab-ci.yml - Chaos testing in CI/CD
chaos-validation:
  stage: post-deploy
  image: chaostoolkit/chaostoolkit
  script:
    # Run lightweight chaos experiments against newly deployed service
    - chaos run experiments/latency-tolerance.json
    - chaos run experiments/error-handling.json
    - chaos run experiments/resource-limits.json

  # Rollback deployment if chaos experiments fail
  when: on_success
  allow_failure: false

  only:
    - staging
    - production

# Schedule continuous chaos in production
scheduled-chaos:
  stage: chaos
  image: chaostoolkit/chaostoolkit
  script:
    - chaos run experiments/production-resilience.json

  # Run daily during business hours
  only:
    - schedules

This configuration demonstrates two automation patterns: chaos validation after deployments (ensuring new code maintains resilience) and scheduled chaos experiments (detecting resilience degradation over time). Both patterns catch issues before they impact customers.

Minimize Blast Radius

Every chaos experiment carries risk of causing actual user impact. Minimizing blast radius limits this risk while still providing valuable learnings. The goal is to learn as much as possible while affecting as few users as possible.

Techniques to limit blast radius:

Traffic percentage: Inject failures affecting only 1-5% of requests initially
Geographic isolation: Run experiments in a single region or availability zone
User segmentation: Target internal users or beta testers first
Service isolation: Test failure in non-critical services before critical ones
Time-boxing: Run experiments for short durations (5-15 minutes) initially

As confidence in a specific experiment grows, gradually expand the blast radius. An experiment that successfully ran at 5% traffic for weeks might be safe to expand to 20%. However, never assume safety - always monitor metrics and maintain abort capabilities.

Implementation with feature flags:

// Control chaos experiment scope with feature flags
interface ChaosConfig {
  enabled: boolean;
  trafficPercentage: number;
  targetRegions: string[];
  targetUserSegments: string[];
  duration: number;
  abortThresholds: {
    errorRate: number;
    p95Latency: number;
  };
}

class ChaosController {
  constructor(private config: ChaosConfig) {}

  shouldInjectChaos(request: Request): boolean {
    // Don't inject if chaos is disabled
    if (!this.config.enabled) return false;

    // Check region targeting
    if (!this.config.targetRegions.includes(request.region)) {
      return false;
    }

    // Check user segment targeting
    if (!this.isTargetUser(request.user)) {
      return false;
    }

    // Apply traffic percentage
    return Math.random() < this.config.trafficPercentage / 100;
  }

  private isTargetUser(user: User): boolean {
    return this.config.targetUserSegments.some(segment =>
      user.segments.includes(segment)
    );
  }
}

This implementation provides fine-grained control over which requests experience chaos injection. By combining multiple limiting factors (region, user segment, percentage), you can target experiments precisely and minimize risk.

Failure Injection Techniques

Failure injection introduces specific types of failures into your system to observe behavior. Different failure types reveal different resilience characteristics. Understanding how to inject failures and what to observe is essential for effective chaos engineering.

Latency Injection

Latency injection adds delays to service responses or network calls. This simulates slow downstream services, network congestion, or overloaded systems. Latency is particularly insidious because systems often handle complete failures (timeouts, errors) better than slow responses.

Why latency is more dangerous than failures:

When a service is completely down, clients typically fail fast (connection refused) and can quickly retry or use fallbacks. When a service is slow, clients may wait, consuming resources (threads, connections) while waiting for responses. This can cascade - if Service A waits 30 seconds for Service B, and Service A receives 100 requests per second, it needs 3,000 concurrent threads to handle the load. Eventually, Service A exhausts its thread pool and starts failing.

This cascading resource exhaustion is often more damaging than the initial slow service. One slow database query can bring down an entire application tier if thread pools exhaust. Understanding how your system handles latency is critical for resilience.

Spring Boot

Implement circuit breakers and timeouts using Resilience4j. See Spring Boot Resilience for patterns.

Implementation example with Spring Boot:

@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class ChaosLatencyFilter extends OncePerRequestFilter {

    @Value("${chaos.latency.enabled:false}")
    private boolean chaosEnabled;

    @Value("${chaos.latency.probability:0.0}")
    private double probability;

    @Value("${chaos.latency.duration:1000}")
    private int latencyMs;

    private final Random random = new Random();

    @Override
    protected void doFilterInternal(HttpServletRequest request,
                                   HttpServletResponse response,
                                   FilterChain filterChain) throws ServletException, IOException {

        if (chaosEnabled && random.nextDouble() < probability) {
            try {
                Thread.sleep(latencyMs);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }

        filterChain.doFilter(request, response);
    }
}

This filter runs at the highest precedence, injecting latency before request processing begins. The configuration uses probabilistic injection - affecting a configurable percentage of requests rather than all requests. This mimics real-world latency which is often intermittent.

Configuration:

# Enable latency injection for 10% of requests with 2-second delay
chaos:
  latency:
    enabled: true
    probability: 0.1
    duration: 2000

What to observe during latency injection:

Does your application implement proper timeouts? Check if requests eventually fail rather than waiting indefinitely.
Do timeout values cascade appropriately? Client timeout should be less than server timeout to prevent thread exhaustion.
Does the system gracefully degrade or completely fail? Some features might become unavailable while core functionality continues.
Are users shown meaningful feedback during slow operations? Loading spinners, progress indicators, or timeout messages improve UX.
Do connection pools get exhausted? Monitor active connections - if they grow unbounded, you have a resource leak.

Protecting against latency with timeouts:

@Service
public class PaymentService {

    private final TimeLimiter timeLimiter = TimeLimiter.of(
        TimeLimiterConfig.custom()
            .timeoutDuration(Duration.ofSeconds(3))
            .build()
    );

    private final PaymentClient paymentClient;

    public CompletableFuture<PaymentResult> processPayment(Payment payment) {
        return timeLimiter.executeFutureSupplier(() ->
            CompletableFuture.supplyAsync(() ->
                paymentClient.charge(payment)
            )
        );
    }
}

The timeout prevents indefinite waiting when the payment service experiences latency. After 3 seconds, the operation fails with a timeout exception, allowing the caller to handle the failure gracefully (retry, fallback, or user notification). This prevents resource exhaustion from slow downstream services.

Error Injection

Error injection returns error responses (HTTP 500, 503, timeouts) from services or API calls. This validates that your application handles errors gracefully and doesn't propagate failures in unexpected ways.

Types of errors to inject:

HTTP errors: 500 Internal Server Error, 503 Service Unavailable, 429 Too Many Requests, 404 Not Found
Network errors: Connection refused, connection timeout, read timeout, host unreachable
Database errors: Connection failures, query timeouts, deadlocks, constraint violations
Exceptions: NullPointerException, IllegalStateException, IOException, RuntimeException

Each error type reveals different handling patterns. A 503 might trigger retry logic, while a 404 should not be retried. Database deadlocks should be retried with backoff, while constraint violations indicate data problems.

TypeScript example with a proxy layer:

interface ChaosConfig {
  enabled: boolean;
  errorRate: number;
  errorType: 'timeout' | 'server_error' | 'not_found';
}

class ChaosHttpClient {
  constructor(
    private baseClient: HttpClient,
    private config: ChaosConfig
  ) {}

  async get<T>(url: string): Promise<T> {
    if (this.config.enabled && Math.random() < this.config.errorRate) {
      return this.injectError();
    }

    return this.baseClient.get<T>(url);
  }

  private async injectError(): Promise<never> {
    switch (this.config.errorType) {
      case 'timeout':
        // Simulate timeout by never resolving
        return new Promise(() => {});

      case 'server_error':
        throw new HttpError(500, 'Internal Server Error (Chaos)');

      case 'not_found':
        throw new HttpError(404, 'Not Found (Chaos)');

      default:
        throw new Error('Unknown error type');
    }
  }
}

This wraps the standard HTTP client, transparently injecting errors based on configuration. The timeout simulation is particularly interesting - by returning a promise that never resolves, it tests whether the caller has proper timeout handling.

Using error injection with circuit breakers:

const circuitBreaker = new CircuitBreaker({
  failureThreshold: 5,
  timeout: 3000,
  resetTimeout: 30000,
});

const chaosClient = new ChaosHttpClient(httpClient, {
  enabled: true,
  errorRate: 0.15,
  errorType: 'server_error',
});

async function fetchUserData(userId: string): Promise<User | null> {
  try {
    return await circuitBreaker.execute(() =>
      chaosClient.get<User>(`/api/users/${userId}`)
    );
  } catch (error) {
    if (error instanceof CircuitBreakerOpenError) {
      // Circuit breaker is open, return cached data or default
      return getCachedUser(userId);
    }

    // Log error and return null (graceful degradation)
    logger.error('Failed to fetch user data', { userId, error });
    return null;
  }
}

The circuit breaker tracks failures. After 5 consecutive failures, it opens the circuit and fails fast without calling the downstream service. This prevents cascading failures. The timeout (3 seconds) ensures that slow responses don't consume resources. The combination of chaos injection and circuit breaker demonstrates resilience in action.

What to observe during error injection:

Are errors caught and handled appropriately? Uncaught exceptions can crash applications.
Do error messages expose sensitive information? Stack traces in responses can leak implementation details.
Does the UI show user-friendly error messages? Users shouldn't see "NullPointerException" or technical jargon.
Are retries implemented with exponential backoff? Immediate retries can overwhelm recovering services.
Do circuit breakers open after repeated failures? This prevents cascading failures.
Are fallback mechanisms (caching, default values) working? Graceful degradation maintains user experience.

Service Unavailability

Complete service unavailability simulates scenarios where a dependency is completely down - database offline, payment gateway unreachable, authentication service crashed.

This differs from error injection because the service doesn't respond at all (connection refused) rather than responding with an error. Applications must handle this differently: connection errors typically trigger immediate retry or failover, whereas error responses might indicate a bug that shouldn't be retried immediately.

Simulating unavailability with network policies (Kubernetes):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-isolate-payment-service
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
  - Ingress
  - Egress
  ingress: []  # Block all ingress traffic
  egress: []   # Block all egress traffic

Applying this network policy makes the payment service unreachable from other services. Network connections will fail with "connection refused" or timeout errors. This simulates complete service unavailability - the most severe type of failure.

Application handling with graceful degradation:

@Service
public class OrderService {

    private final PaymentServiceClient paymentClient;
    private final CircuitBreaker circuitBreaker;

    public Order createOrder(OrderRequest request) {
        Order order = buildOrder(request);

        try {
            // Attempt payment processing
            PaymentResult result = circuitBreaker.executeSupplier(() ->
                paymentClient.processPayment(order.getPaymentDetails())
            );

            order.setPaymentStatus(PaymentStatus.COMPLETED);
            order.setPaymentId(result.getPaymentId());

        } catch (CallNotPermittedException e) {
            // Circuit breaker is open - payment service is down
            order.setPaymentStatus(PaymentStatus.PENDING);
            queuePaymentForRetry(order);

            logger.warn("Payment service unavailable, queued for retry",
                Map.of("orderId", order.getId()));

        } catch (Exception e) {
            // Unexpected error during payment
            order.setPaymentStatus(PaymentStatus.FAILED);

            logger.error("Payment processing failed",
                Map.of("orderId", order.getId(), "error", e.getMessage()));
        }

        return orderRepository.save(order);
    }
}

The application gracefully degrades: orders are created with PENDING payment status and queued for later processing rather than completely failing the order creation. This maintains core functionality (order placement) even when payment processing is unavailable. When the payment service recovers, a background job processes pending payments.

What to observe during unavailability:

Does the application fail fast or hang waiting for unavailable services? Connection timeouts should be aggressive (1-3 seconds).
Are there proper timeout configurations at all layers? Application code, connection pools, load balancers, and proxies all need timeouts.
Do asynchronous processing mechanisms (queues, scheduled jobs) handle deferred operations? Some operations can be retried later.
Is the user experience acceptable when services are unavailable? Users should understand what happened and what to expect.

Resource Exhaustion

Resource exhaustion simulates running out of critical resources: CPU saturation, memory depletion, disk space, file descriptors, or database connections. These failures often occur gradually in production as traffic grows or resource leaks accumulate.

CPU exhaustion simulation:

// Chaos agent that consumes CPU
class CpuChaos {
  private isRunning = false;

  start(durationMs: number) {
    this.isRunning = true;
    const endTime = Date.now() + durationMs;

    // Spin in tight loop consuming CPU
    while (Date.now() < endTime && this.isRunning) {
      // Intensive computation
      Math.sqrt(Math.random());
    }
  }

  stop() {
    this.isRunning = false;
  }
}

While this demonstrates the concept, production implementations should use tools like stress-ng or container resource limits:

# docker-compose.yml with CPU limits
services:
  api:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '0.5'  # Limit to 50% of one CPU core
          memory: 512M

Restricting CPU and memory simulates resource constraints. This reveals how applications behave under resource pressure - do they gracefully degrade, or do they become unresponsive?

Memory exhaustion simulation:

// Memory leak simulation for chaos testing
@RestController
@RequestMapping("/chaos")
public class MemoryChaosController {

    private final List<byte[]> memoryHog = new ArrayList<>();

    @PostMapping("/memory-leak")
    public ResponseEntity<String> triggerMemoryLeak(@RequestParam int sizeMb) {
        // Allocate memory that won't be garbage collected
        byte[] allocation = new byte[sizeMb * 1024 * 1024];
        memoryHog.add(allocation);

        return ResponseEntity.ok("Allocated " + sizeMb + "MB");
    }

    @PostMapping("/clear-memory")
    public ResponseEntity<String> clearMemory() {
        memoryHog.clear();
        System.gc();
        return ResponseEntity.ok("Memory cleared");
    }
}

This endpoint intentionally creates a memory leak by holding references to large byte arrays. In production, memory leaks occur accidentally through unbounded caches, event listeners that aren't cleaned up, or thread-local variables that aren't cleared. This simulation helps validate that monitoring detects memory issues and that applications handle OutOfMemoryError gracefully.

Connection pool exhaustion:

@Configuration
public class DatabaseConfig {

    @Bean
    public HikariConfig hikariConfig() {
        HikariConfig config = new HikariConfig();

        // Configure connection pool
        config.setMaximumPoolSize(10);  // Small pool to test exhaustion
        config.setConnectionTimeout(3000);  // Fast failure
        config.setLeakDetectionThreshold(60000);  // Detect leaked connections

        return config;
    }
}

Setting a small pool size makes it easy to exhaust connections during testing. The leak detection threshold logs warnings when connections are held for more than 60 seconds, helping identify connection leaks. The connection timeout ensures that when the pool is exhausted, callers fail fast rather than waiting indefinitely.

What to observe during resource exhaustion:

Does the application handle resource limits gracefully? It should fail specific operations rather than crashing entirely.
Are there proper monitoring alerts for resource usage? Alerts should fire before resources are completely exhausted.
Do resource limits cause cascading failures? One exhausted thread pool shouldn't bring down unrelated features.
Are there memory leaks or connection leaks that become apparent under stress? Production workloads often expose leaks that don't appear in testing.
Does autoscaling respond appropriately to resource pressure? Infrastructure should scale before resources exhaust.

Chaos Engineering Tools

Multiple tools exist for implementing chaos engineering, ranging from simple failure injection libraries to comprehensive chaos platforms. Choosing the right tool depends on your infrastructure, team expertise, and chaos maturity.

Chaos Monkey

Chaos Monkey, developed by Netflix, randomly terminates instances in production to ensure that services can tolerate instance failures. The philosophy is that if your architecture can't handle random instance termination, it's not resilient enough for production.

How it works:

Chaos Monkey runs on a schedule (e.g., weekdays during business hours) and randomly selects instances from specified Auto Scaling Groups (ASGs) to terminate. Services must automatically recover through load balancer health checks removing failed instances, Auto Scaling launching replacement instances, and service discovery updating available instance lists.

# Chaos Monkey configuration
chaos:
  monkey:
    enabled: true
    schedule: "0 9-17 * * MON-FRI"  # 9 AM - 5 PM weekdays
    leashed: false  # false = actually terminate instances

    termination:
      probability: 1.0  # Probability of terminating when scheduled
      max-per-asg: 1    # Maximum instances to terminate per ASG

    asgs:
      - name: api-production-asg
        enabled: true
      - name: worker-production-asg
        enabled: true

Running Chaos Monkey during business hours ensures that engineers are available to respond if terminations cause unexpected issues. The "leashed" flag controls whether terminations actually occur or are just simulated - useful for initial testing.

Implementing Chaos Monkey-like behavior in Kubernetes:

For Kubernetes environments, use tools like kube-monkey or chaos-mesh:

# kube-monkey configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-monkey-config
data:
  config.toml: |
    [kubemonkey]
    run_hour = 10
    start_hour = 9
    end_hour = 17
    blacklisted_namespaces = ["kube-system", "kube-public"]
    whitelisted_namespaces = ["production"]

Mark deployments for chaos testing:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  labels:
    kube-monkey/enabled: "enabled"
    kube-monkey/mtbf: "3"  # Mean time between failures (days)
    kube-monkey/kill-mode: "fixed"
    kube-monkey/kill-value: "1"  # Kill 1 pod
spec:
  replicas: 3
  template:
    # ... pod spec

The MTBF (mean time between failures) controls how frequently pods are killed on average. A value of 3 means pods are killed approximately every 3 days. With multiple pods, you ensure that some instances are always available.

Gremlin

Gremlin is a commercial chaos engineering platform providing a UI and API for running controlled chaos experiments across infrastructure, networks, and applications.

Key features:

Resource attacks: CPU, memory, disk, I/O exhaustion
State attacks: Shutdown, reboot, process kill, time travel (clock skew)
Network attacks: Latency, packet loss, DNS failures, blackhole
Blast radius control: Target specific hosts, containers, or percentages
Safety controls: Halt conditions, scheduling, approvals

Example Gremlin attack via API:

# Inject 500ms latency to 25% of requests to payment service
curl -X POST https://api.gremlin.com/v1/attacks/new \
  -H "Authorization: Bearer $GREMLIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "target": {
      "type": "Random",
      "percent": 25,
      "tags": {
        "service": "payment-service",
        "env": "production"
      }
    },
    "command": {
      "type": "latency",
      "args": ["-l", "500", "-p", "^80$"]
    },
    "length": 600
  }'

This targets 25% of instances tagged with service=payment-service and env=production, injecting 500ms latency on port 80 for 600 seconds (10 minutes). The API-driven approach enables integration with CI/CD pipelines.

Integration in CI/CD:

# .gitlab-ci.yml
chaos-test:
  stage: test
  image: gremlin/gremlin
  script:
    - gremlin init --api-key $GREMLIN_API_KEY
    - |
      # Run latency attack for 5 minutes
      gremlin attack network-latency \
        --tag service=api-staging \
        --delay 200 \
        --length 300

    - |
      # Monitor application metrics during attack
      ./scripts/check-metrics.sh

  only:
    - staging

The pipeline runs chaos experiments in staging before production deployments, ensuring that new code maintains resilience.

Chaos Toolkit

Chaos Toolkit is an open-source framework for declaring and running chaos experiments in a standardized format. Experiments are defined in JSON or YAML and can be version-controlled alongside application code.

Experiment definition:

{
  "title": "Payment service handles database failures gracefully",
  "description": "Verify that payment processing degrades gracefully when database becomes unavailable",
  "steady-state-hypothesis": {
    "title": "Application is healthy",
    "probes": [
      {
        "type": "probe",
        "name": "api-is-responding",
        "tolerance": {
          "type": "http",
          "status": 200
        },
        "provider": {
          "type": "http",
          "url": "https://api.example.com/health"
        }
      },
      {
        "type": "probe",
        "name": "payment-success-rate-is-high",
        "tolerance": {
          "type": "range",
          "range": [0.95, 1.0]
        },
        "provider": {
          "type": "python",
          "module": "metrics",
          "func": "get_payment_success_rate",
          "arguments": {"minutes": 5}
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "terminate-database-connection",
      "provider": {
        "type": "python",
        "module": "chaosaws.rds.actions",
        "func": "stop_instance",
        "arguments": {
          "instance_id": "prod-payment-db"
        }
      }
    },
    {
      "type": "probe",
      "name": "wait-for-recovery",
      "provider": {
        "type": "process",
        "path": "sleep",
        "arguments": "60"
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "restart-database",
      "provider": {
        "type": "python",
        "module": "chaosaws.rds.actions",
        "func": "start_instance",
        "arguments": {
          "instance_id": "prod-payment-db"
        }
      }
    }
  ]
}

The experiment structure is declarative: define steady state, specify actions to take, and define rollback procedures. Chaos Toolkit verifies steady state before and after the experiment, ensuring that the system recovers.

Running the experiment:

chaos run experiment.json

What this experiment does:

Verifies steady state (API healthy, payment success rate > 95%)
Stops the database instance
Waits 60 seconds
Verifies steady state again (should still be healthy due to queueing, caching, or graceful degradation)
Restarts the database (rollback)

The declarative format makes experiments reproducible and version-controllable. You can track changes to experiments over time, review them in pull requests, and run them automatically in CI/CD.

LitmusChaos

LitmusChaos is a Kubernetes-native chaos engineering platform. Chaos experiments are defined as Kubernetes Custom Resources, making them declarative and version-controlled.

Pod delete experiment:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-service-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment

  chaosServiceAccount: litmus-admin

  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"

            - name: CHAOS_INTERVAL
              value: "10"

            - name: FORCE
              value: "false"

            - name: PODS_AFFECTED_PERC
              value: "25"  # Kill 25% of pods

This defines a chaos experiment that deletes 25% of payment service pods every 10 seconds for 60 seconds total. The experiment runs as a Kubernetes job, making it fully integrated with Kubernetes workflows.

Network chaos experiment:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-network-chaos
spec:
  appinfo:
    appns: production
    applabel: app=api
    appkind: deployment

  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_LATENCY
              value: "2000"  # 2 second latency

            - name: TARGET_CONTAINER
              value: "api-container"

            - name: TOTAL_CHAOS_DURATION
              value: "300"  # 5 minutes

Apply the ChaosEngine to start the experiment:

kubectl apply -f chaos-experiment.yaml

Monitor experiment progress:

kubectl describe chaosengine payment-service-chaos -n production

LitmusChaos integrates naturally with Kubernetes workflows. Chaos experiments can be triggered by deployments, run on schedules via CronJobs, or executed manually. Results are stored as Kubernetes resources, making them queryable and auditable.

AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) is a managed service for running fault injection experiments on AWS infrastructure. It provides safe, controlled chaos experiments with built-in safety mechanisms.

Experiment template for AZ failure:

{
  "description": "Simulate availability zone failure",
  "targets": {
    "myInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Environment": "production",
        "Service": "api"
      },
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "stopInstances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "myInstances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:HighErrorRate"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole"
}

Start experiment via AWS CLI:

aws fis start-experiment \
  --experiment-template-id EXT123456 \
  --tags Service=api,Experiment=az-failure

The experiment stops all EC2 instances in us-east-1a. The stop condition (CloudWatch alarm on error rate) automatically halts the experiment if error rates exceed acceptable thresholds. This safety mechanism prevents runaway chaos experiments from causing extended outages.

Game Days and Fire Drills

Game days are scheduled events where teams run chaos experiments and simulate incident response. They build confidence, identify gaps in runbooks, and train team members in incident handling. Unlike automated chaos experiments that run continuously, game days are deliberate learning exercises.

Planning a Game Day

1. Define objectives:

Test specific failure scenarios (database failure, region outage, deployment rollback)
Validate incident response procedures (communication, escalation, decision-making)
Train new team members on incident handling
Identify gaps in monitoring and alerting
Practice using disaster recovery procedures

Clear objectives keep game days focused. "Test database failover" is specific; "test resilience" is too vague.

2. Scope the experiment:

Choose services to target (start with non-critical services)
Define blast radius (staging vs. production, percentage of traffic)
Set time limits (1-2 hours typically)
Prepare rollback procedures (detailed step-by-step)
Identify success criteria (what are we hoping to learn?)

3. Communicate:

Notify all stakeholders (engineering, product, customer support) at least 1 week in advance
Schedule during business hours with team availability
Prepare incident communication templates
Set up dedicated Slack channel or war room for real-time coordination
Brief customer support on what to expect and how to respond to questions

Communication prevents confusion. Without it, game days can be mistaken for real incidents.

4. Prepare monitoring:

Ensure all dashboards are accessible (share links in advance)
Set up real-time metric monitoring (error rates, latency, throughput)
Configure alerts (but mark them as game day to avoid on-call escalation)
Prepare log queries for investigation
Set up recording (screen capture, metrics) for post-game review

Example game day schedule:

Game Day: Database Failover Test
Duration: 2 hours
Team: Backend engineers, SREs, DBA
Participants: 8 engineers + 1 observer

09:00 - Kickoff (15 minutes)
  - Review objectives and success criteria
  - Confirm all participants and roles
  - Verify monitoring and dashboards
  - Review communication procedures

09:15 - Baseline metrics (15 minutes)
  - Record current system state
  - Verify steady state hypothesis
  - Take metric snapshots for comparison

09:30 - Inject failure (5 minutes)
  - Game master stops primary database instance
  - Observer starts timeline documentation

09:35 - Monitor and respond (25 minutes)
  - Track application metrics (error rate, latency)
  - Identify impact on user experience
  - Execute incident response procedures
  - Practice incident communication

10:00 - Restore service (30 minutes)
  - Verify failover to secondary complete
  - Restore primary for failback test
  - Validate data consistency

10:30 - Failback test (30 minutes)
  - Switch back to primary database
  - Verify zero data loss
  - Confirm all services healthy

11:00 - Retrospective (1 hour)
  - What went well?
  - What surprised us?
  - What needs improvement?
  - Action items and owners
  - Schedule follow-up

The schedule provides structure while allowing flexibility. Timing may shift based on what's discovered, but having a plan keeps the event focused.

Running the Game Day

Roles:

Game Master: Orchestrates the event, injects failures, ensures safety, can abort at any time
Incident Commander: Leads incident response, coordinates team, makes decisions
Engineers: Investigate issues, implement fixes, execute runbooks
Observer: Takes notes, documents timeline, captures screenshots, doesn't participate in response
Communications: Updates stakeholders, prepares status updates, monitors customer support channels

Clearly defined roles prevent confusion. The observer role is critical - someone focused solely on documentation captures details that responders miss.

During the event:

This sequence shows the flow of communication and decision-making. The game master controls the experiment, but the incident commander leads the response, simulating a real incident.

Capture observations:

Timestamp when failure was injected
Time to detect the issue (first alert)
Time to identify root cause
Time to begin remediation
Time to full recovery
User impact metrics (error rate, affected users, revenue impact)
Gaps identified (missing alerts, unclear runbooks, insufficient permissions, missing documentation)
Surprising behaviors (unexpected failures, cascading issues, successful mitigations)

The observer documents everything in real-time. This documentation is invaluable for the retrospective and for improving incident response procedures.

Post-Game Day Actions

The true value of game days comes from actionable improvements identified during the exercise. Without follow-through, game days are just expensive tests.

Post-mortem structure:

# Game Day Post-Mortem: Database Failover

## Date: 2024-01-15
## Duration: 2 hours
## Participants: 8 engineers

## Objective
Test automatic database failover and validate incident response procedures.

## What Happened
- 09:30: Primary database instance stopped
- 09:31: Application error rate increased to 45%
- 09:32: Alerts fired (database connection errors)
- 09:33: Team identified database issue via logs
- 09:35: Initiated manual failover process
- 09:40: Failover completed, error rate decreased
- 09:45: Error rate returned to baseline (< 0.1%)

## What Went Well
- Alerts fired within expected timeframe (2 minutes)
- Team quickly identified root cause (logs clearly showed database connectivity issues)
- Failover procedure worked as documented
- No data loss occurred
- Communication was clear and timely

## What Could Be Improved
- Failover took 10 minutes (target: 5 minutes) - manual steps slowed response
- Customer support wasn't notified proactively - they learned from customers
- Dashboard didn't clearly show database status - required checking multiple sources
- Runbook had outdated commands - required improvisation
- One engineer lacked necessary permissions - delayed response

## Action Items
1. [Owner: DBA Team] Implement automatic database failover - Due: Feb 1
2. [Owner: Platform] Add database status to main dashboard - Due: Jan 22
3. [Owner: SRE] Update runbook with current commands and screenshots - Due: Jan 20
4. [Owner: IC] Create communication template for database incidents - Due: Jan 18
5. [Owner: All] Audit permissions for incident responders - Due: Jan 25
6. [Owner: All] Schedule follow-up game day to test automatic failover - Due: Mar 1

## Metrics
- Detection time: 2 minutes
- Root cause identification: 3 minutes
- Time to remediation: 5 minutes
- Full recovery: 15 minutes
- Error rate peak: 45%
- User impact: ~500 users experienced errors

Track action items like any other engineering work - assign owners, set deadlines, and review progress in team meetings. Without accountability, action items become aspirational rather than actual improvements.

Observability Requirements

Chaos experiments are only valuable if you can observe their impact. Comprehensive observability is a prerequisite for chaos engineering. Without it, you're flying blind - unable to tell if experiments succeed or cause damage.

See Logging, Metrics, and Tracing for detailed observability implementation.

Metrics to Monitor

Application metrics:

Request rate (requests per second) - sudden drops indicate failures
Error rate (percentage of failed requests) - primary chaos experiment indicator
Response time (p50, p95, p99 percentiles) - p99 reveals tail latency
Availability (percentage of successful requests) - overall health indicator

Infrastructure metrics:

CPU utilization - spikes indicate resource pressure
Memory usage - growth indicates leaks
Disk I/O - saturation causes performance degradation
Network throughput - congestion affects distributed systems
Database connection pool usage - exhaustion causes failures

Business metrics:

Transaction completion rate - revenue impact of failures
User signups/conversions - impact on business objectives
Feature usage - which features degrade during failures
Customer support ticket volume - user-facing impact

Example dashboard layout:

This dashboard structure provides immediate visibility into experiment impact across technical and business dimensions. During chaos experiments, watch all three categories - a successful experiment maintains business metrics despite infrastructure failures.

Distributed Tracing

Distributed tracing shows request flow through multiple services, invaluable for understanding how failures propagate. When a chaos experiment causes unexpected impact, traces reveal exactly where failures occurred and how they cascaded.

@Service
public class OrderService {

    @Autowired
    private Tracer tracer;

    public Order createOrder(OrderRequest request) {
        Span span = tracer.spanBuilder("createOrder").startSpan();

        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("order.id", request.getOrderId());
            span.setAttribute("user.id", request.getUserId());

            // Process order...
            Order order = processOrder(request);

            span.setStatus(StatusCode.OK);
            return order;

        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Order creation failed");
            throw e;

        } finally {
            span.end();
        }
    }
}

During chaos experiments, traces reveal which service actually failed, how long requests waited before timing out, which fallback paths were taken, and where retries occurred. This visibility is essential for understanding complex failure modes in distributed systems.

Alerting Configuration

Configure alerts to detect when chaos experiments cause unexpected impact. These alerts act as automated abort conditions.

# Prometheus alerting rules for chaos experiments
groups:
  - name: chaos-experiment-alerts
    interval: 10s
    rules:
      - alert: HighErrorRateDuringChaos
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[1m])
            / rate(http_requests_total[1m])
          ) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeded 5% during chaos experiment"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: SlowResponseTimeDuringChaos
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[1m])
          ) > 2.0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 response time exceeded 2 seconds"
          description: "P95 latency is {{ $value | humanizeDuration }}"

These alerts fire when error rates exceed 5% or P95 latency exceeds 2 seconds. During chaos experiments, these thresholds act as abort conditions - if metrics breach them, immediately stop the experiment and investigate. The 2-minute duration prevents alerts from firing on transient spikes.

Rollback and Recovery Procedures

Every chaos experiment must have a clearly defined rollback procedure. Never inject failure without knowing how to restore normal operation. Rollback procedures are your safety net - they prevent chaos experiments from becoming chaos incidents.

Automated Rollback

Implement automated rollback when metrics exceed thresholds:

@Service
public class ChaosOrchestrator {

    @Autowired
    private MetricsService metricsService;

    @Autowired
    private ChaosInjector chaosInjector;

    private ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);

    public void startExperiment(ChaosExperiment experiment) {
        // Start chaos injection
        chaosInjector.inject(experiment);

        // Monitor metrics every 10 seconds
        ScheduledFuture<?> monitor = scheduler.scheduleAtFixedRate(() -> {
            MetricSnapshot metrics = metricsService.getCurrentMetrics();

            if (metrics.getErrorRate() > experiment.getAbortThreshold()) {
                logger.error("Error rate exceeded threshold, aborting experiment");
                abortExperiment(experiment);
            }

            if (metrics.getP95Latency() > experiment.getLatencyThreshold()) {
                logger.error("Latency exceeded threshold, aborting experiment");
                abortExperiment(experiment);
            }
        }, 0, 10, TimeUnit.SECONDS);

        // Automatically stop after duration
        scheduler.schedule(() -> {
            stopExperiment(experiment);
            monitor.cancel(false);
        }, experiment.getDurationMinutes(), TimeUnit.MINUTES);
    }

    private void abortExperiment(ChaosExperiment experiment) {
        chaosInjector.stop(experiment);
        notificationService.alert("Chaos experiment aborted due to threshold breach");
    }
}

This orchestrator monitors metrics continuously and automatically aborts experiments when thresholds are breached. The automatic time limit ensures experiments don't run indefinitely if monitoring fails.

Manual Recovery

For experiments that can't be automatically rolled back, maintain clear runbooks:

# Rollback Procedure: Database Chaos Experiment

## Stop Criteria
- Error rate > 5%
- P95 latency > 3 seconds
- User complaints received
- Duration exceeded 15 minutes
- Any unexpected behavior

## Rollback Steps

### 1. Stop Chaos Injection
```bash
kubectl delete chaosengine database-chaos -n production

2. Verify Chaos Stopped

kubectl get pods -n production | grep chaos
# Should show no chaos pods

3. Restart Affected Services

kubectl rollout restart deployment/api-service -n production
kubectl rollout restart deployment/worker-service -n production

4. Monitor Recovery

Check dashboard: https://grafana.example.com/chaos-dashboard
Verify error rate returning to < 0.1%
Verify P95 latency returning to < 200ms
Check active user count returning to baseline

5. Notify Stakeholders

Post in #incidents Slack channel
Update status page if customer-facing
Notify customer support team

6. Escalation

If recovery doesn't complete within 5 minutes:

Page on-call SRE: @sre-oncall
Initiate incident response procedure
Document timeline for post-mortem

Clear, step-by-step procedures prevent panic during rollback. Include exact commands, expected outputs, and escalation paths.

### Testing Recovery Procedures

Recovery procedures are only valuable if they work. Periodically test recovery without actually causing an incident:

```bash
# Dry-run rollback script
./rollback-chaos.sh --dry-run --experiment database-chaos

# Output:
# [DRY RUN] Would delete chaosengine: database-chaos
# [DRY RUN] Would restart deployment: api-service
# [DRY RUN] Would restart deployment: worker-service
# [DRY RUN] Would post notification to Slack

This validates that scripts work, credentials are correct, and team members know how to execute procedures. Run dry-run tests quarterly to ensure procedures stay current as infrastructure evolves.

Anti-Patterns

Running Chaos Without Observability

Running chaos experiments without comprehensive observability is like turning off the lights before walking through an unfamiliar room. You might reach the other side, but you'll bump into things along the way and won't learn anything about the room's layout.

Why it's problematic: Without metrics, logs, and traces, you can't tell if experiments succeed or cause damage. You might inject latency and see no obvious issues, but miss that error rates doubled or users abandoned transactions.

Instead: Establish comprehensive observability before chaos engineering. See Observability Overview for implementation patterns.

Starting with Production

Beginning chaos engineering in production without testing in lower environments is reckless. While production is the ultimate validation environment, it's not the place to learn how your chaos tooling works.

Why it's problematic: You might misconfigure chaos injection, causing more damage than intended. Your rollback procedures might not work. Your team might not know how to respond.

Instead: Start in development or staging environments. Validate experiment design, test rollback procedures, and train team members before moving to production.

Ignoring Action Items

Running game days and chaos experiments without acting on findings wastes time and money. If you discover gaps but don't fix them, the next real incident will hit the same gaps.

Why it's problematic: Teams become cynical about chaos engineering if it's just "breaking things without improving anything." Unaddressed gaps remain vulnerabilities.

Instead: Treat action items from chaos experiments like any other engineering work. Assign owners, set deadlines, track progress, and review in team meetings.

Chaos for Chaos Sake

Running chaos experiments without clear objectives or hypotheses doesn't build confidence - it builds confusion. Random destruction isn't engineering.

Why it's problematic: Without objectives, you don't know what success looks like. You can't tell if experiments reveal issues or if everything is fine.

Instead: Define clear hypotheses before experiments. "We believe error rates will remain below 1% even if one database replica fails." This makes success measurable.

Not Communicating

Running production chaos experiments without notifying stakeholders causes panic. Customer support receives complaints about errors, product managers wonder why metrics dropped, and executives question system stability.

Why it's problematic: Chaos experiments can be mistaken for real incidents. This erodes trust in engineering teams and systems.

Instead: Communicate widely about planned experiments. Notify engineering, product, customer support, and executives at least 1 week in advance for production experiments.

Unbounded Blast Radius

Running chaos experiments without blast radius limits risks actual customer impact. Starting with 100% of production traffic is reckless.

Why it's problematic: If experiments reveal unexpected issues, many users are affected before you can abort.

Instead: Start with 1-5% of traffic, expand gradually based on confidence. Use geographic isolation, user segmentation, and time limits.

Best Practices Summary

Start small - Begin with non-production environments and minimal blast radius; gradually expand scope as confidence grows
Define success criteria - Establish clear steady-state metrics before experiments; know what "success" looks like
Automate experiments - Integrate chaos into CI/CD pipelines; run continuously, not just during game days
Monitor comprehensively - Ensure observability is in place before chaos engineering; you can't improve what you can't measure
Always have rollback - Never inject failure without clear recovery procedures; test rollback procedures regularly
Learn and improve - Document findings, create action items, and track improvements; chaos engineering is about learning, not just breaking things
Communicate widely - Notify stakeholders of planned experiments; avoid surprise chaos that looks like actual incidents
Limit blast radius - Use percentage-based targeting, time limits, and geographic isolation to minimize risk
Build blameless culture - Focus on system improvement, not individual mistakes; chaos reveals weaknesses in systems, not people

Overview​

Core Principles​

Build a Hypothesis Around Steady State​

Vary Real-World Events​

Run Experiments in Production​

Automate Experiments to Run Continuously​

Minimize Blast Radius​

Failure Injection Techniques​

Latency Injection​

Error Injection​

Service Unavailability​

Resource Exhaustion​

Chaos Engineering Tools​

Chaos Monkey​

Gremlin​

Chaos Toolkit​

LitmusChaos​

AWS Fault Injection Simulator​

Game Days and Fire Drills​

Planning a Game Day​

Running the Game Day​

Post-Game Day Actions​

Observability Requirements​

Metrics to Monitor​

Distributed Tracing​

Alerting Configuration​

Rollback and Recovery Procedures​

Automated Rollback​

Manual Recovery​

2. Verify Chaos Stopped​

3. Restart Affected Services​

4. Monitor Recovery​

5. Notify Stakeholders​

6. Escalation​

Anti-Patterns​

Running Chaos Without Observability​

Starting with Production​

Ignoring Action Items​

Chaos for Chaos Sake​

Not Communicating​

Unbounded Blast Radius​

Best Practices Summary​

Further Reading​

Foundational Concepts​

Industry Practice​

Tools and Implementation​

Related Guidelines​