Spring Boot Resilience Patterns

Building fault-tolerant Spring Boot applications that gracefully handle failures in external dependencies.

Overview

Distributed systems fail. Networks are unreliable, services go down, databases become overloaded. Resilience patterns help your application survive these failures without cascading outages or poor user experience.

This guide covers practical resilience patterns using Resilience4j: circuit breakers, retries, timeouts, rate limiting, and bulkheads. Each pattern solves a specific failure scenario.

Core Principles

Fail fast: Don't wait for timeouts, detect failures quickly
Isolate failures: One failing dependency shouldn't bring down the whole system
Graceful degradation: Provide reduced functionality instead of complete failure
Self-healing: Automatically recover when dependencies come back online
Avoid retry storms: Back off and give failing services time to recover

Understanding the Problem

Without resilience patterns, a single slow/failing dependency can bring down your entire application:

Your Service → Slow Database
   ↓
Thread pool exhausted waiting for DB responses
   ↓
All requests blocked (even ones not using the DB)
   ↓
Complete service outage

Resilience patterns prevent this cascade by limiting blast radius and enabling recovery.

Setup

build.gradle:

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-aop'  // Required
    implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.3.0'
    implementation 'io.github.resilience4j:resilience4j-circuitbreaker:2.3.0'
    implementation 'io.github.resilience4j:resilience4j-retry:2.3.0'
    implementation 'io.github.resilience4j:resilience4j-timelimiter:2.3.0'
    implementation 'io.github.resilience4j:resilience4j-ratelimiter:2.3.0'
    implementation 'io.github.resilience4j:resilience4j-bulkhead:2.3.0'
}

Why AOP is required: Resilience4j uses Spring AOP to intercept method calls and wrap them with resilience logic. Without it, the annotations won't work.

Circuit Breaker Pattern

The Problem It Solves

When a dependency is down, you keep calling it and waiting for timeouts. This wastes resources and delays failure detection:

Payment Service calls Account Service (down)
  → Wait 5 seconds for timeout
  → Retry 3 times = 15 seconds wasted per request
  → Hundreds of threads blocked waiting
  → System grinds to halt

How Circuit Breakers Work

A circuit breaker tracks failures and "opens" after a threshold, immediately failing requests without calling the dependency:

States explained:

Closed: Everything working, calls pass through
Open: Too many failures, calls fail immediately (no waiting)
Half-Open: Testing if dependency recovered (allow limited calls)

Configuration

application.yml:

resilience4j:
  circuitbreaker:
    configs:
      default:
        sliding-window-size: 10                    # Track last 10 calls
        failure-rate-threshold: 50                 # Open if ≥50% fail
        wait-duration-in-open-state: 30s           # Wait 30s before testing recovery
        permitted-number-of-calls-in-half-open-state: 3  # Test with 3 calls
        slow-call-duration-threshold: 2s           # Calls >2s counted as slow
        slow-call-rate-threshold: 50               # Open if ≥50% slow

    instances:
      paymentGateway:
        base-config: default
        failure-rate-threshold: 60  # Override: tolerate more failures for this service

What each setting means:

sliding-window-size: How many recent calls to track (rolling window)
failure-rate-threshold: Percent failures needed to open circuit (50% = half)
wait-duration-in-open-state: How long to wait before testing recovery
permitted-number-of-calls-in-half-open-state: Limited calls to test if service recovered
slow-call-duration-threshold: Timeouts >2s are considered failures
slow-call-rate-threshold: Too many slow calls also opens circuit

Basic Usage

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

    private final RestClient restClient;

    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "processPaymentFallback")
    public PaymentResult processPayment(PaymentRequest request) {
        log.info("Calling payment gateway: amount={}", request.amount());

        // This call is protected by circuit breaker
        // If gateway is down, circuit opens and this isn't called
        return restClient.post()
            .uri("/api/payments")
            .body(request)
            .retrieve()
            .body(PaymentResult.class);
    }

    // Fallback method - called when circuit is open or call fails
    private PaymentResult processPaymentFallback(PaymentRequest request, Exception ex) {
        log.error("Payment gateway unavailable, using fallback: {}", ex.getMessage());

        // Option 1: Return cached/default response
        return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");

        // Option 2: Throw business exception
        // throw new PaymentGatewayUnavailableException("Gateway down", ex);
    }
}

How this works:

First few calls to gateway fail
After failure threshold (50%), circuit opens
Next calls don't reach gateway - immediately return fallback
After 30 seconds, circuit goes half-open
If test calls succeed, circuit closes (back to normal)
If test calls fail, circuit stays open for another 30 seconds

Fallback method signature: Must match original method parameters + Exception at end.

Monitoring Circuit State

@Component
@RequiredArgsConstructor
public class CircuitBreakerMonitor {

    private final CircuitBreakerRegistry circuitBreakerRegistry;

    @PostConstruct
    public void registerEventListeners() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

        cb.getEventPublisher()
            .onStateTransition(event -> {
                // Circuit state changed (CLOSED → OPEN, etc.)
                log.warn("Circuit breaker state transition: {} -> {}",
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState());

                // Alert operations team
                if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
                    alertOps("Payment gateway circuit opened - service may be down");
                }
            })
            .onFailureRateExceeded(event -> {
                log.error("Failure rate exceeded: {}%", event.getFailureRate());
            })
            .onSlowCallRateExceeded(event -> {
                log.warn("Slow call rate exceeded: {}%", event.getSlowCallRate());
            });
    }
}

Why monitor state transitions: Circuit opening means a dependency is failing. This is an operational alert - someone needs to investigate.

Retry Pattern

The Problem It Solves

Network glitches cause transient failures - temporary errors that resolve themselves:

Call fails due to network blip
  ↓
Without retry: User sees error
  ↓
With retry: Second call succeeds, user doesn't notice

How Retries Work

Automatically retry failed calls with exponential backoff to avoid overwhelming the failing service:

First attempt:  Fails
Wait 1 second
Second attempt: Fails
Wait 2 seconds (exponential backoff)
Third attempt:  Succeeds! ✓

Configuration

application.yml:

resilience4j:
  retry:
    configs:
      default:
        max-attempts: 3                           # Try 3 times total
        wait-duration: 1s                         # Initial wait between retries
        enable-exponential-backoff: true          # Increase wait each retry
        exponential-backoff-multiplier: 2         # Double wait time each retry
        retry-exceptions:                         # Only retry these exceptions
          - java.net.SocketTimeoutException
          - java.io.IOException
          - org.springframework.web.client.ResourceAccessException
        ignore-exceptions:                        # Never retry these
          - com.bank.payments.exception.ValidationException
          - java.lang.IllegalArgumentException

    instances:
      paymentGateway:
        base-config: default
        max-attempts: 5       # More retries for payment gateway
        wait-duration: 2s     # Longer initial wait

Why exponential backoff: Prevents retry storms. If service is overloaded, constant retries make it worse. Exponential backoff gives it time to recover.

Why ignore-exceptions: Some errors are permanent (bad request, validation failure). Retrying them wastes time and resources.

Usage with Circuit Breaker

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

    @Retry(name = "paymentGateway")  // Applied first
    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback")  // Applied second
    public PaymentResult processPayment(PaymentRequest request) {
        log.info("Attempting payment gateway call (will retry if fails)");

        return restClient.post()
            .uri("/api/payments")
            .body(request)
            .retrieve()
            .body(PaymentResult.class);
    }

    private PaymentResult fallback(PaymentRequest request, Exception ex) {
        log.error("All retries exhausted, circuit may open: {}", ex.getMessage());
        return new PaymentResult(null, PaymentStatus.FAILED, "Service unavailable");
    }
}

Order matters:

@Retry runs first - retries on transient failures
If all retries fail, @CircuitBreaker sees it as one failure
After multiple such failures, circuit opens
When circuit is open, retries don't even happen (fail fast)

Custom Retry Logic

Sometimes you need different retry strategies:

@Service
@RequiredArgsConstructor
public class PaymentService {

    private final RetryRegistry retryRegistry;

    public PaymentResult processWithCustomRetry(PaymentRequest request) {
        // Create custom retry for this specific call
        Retry retry = retryRegistry.retry("payment-" + request.customerId(),
            RetryConfig.custom()
                .maxAttempts(5)
                .waitDuration(Duration.ofMillis(500))
                .retryOnException(ex -> {
                    // Custom logic: only retry on timeout, not business errors
                    return ex instanceof SocketTimeoutException;
                })
                .onRetry(event -> {
                    log.warn("Retry attempt {}: {}",
                        event.getNumberOfRetryAttempts(),
                        event.getLastThrowable().getMessage());
                })
                .build()
        );

        // Wrap call in retry logic
        return retry.executeCallable(() -> callPaymentGateway(request));
    }
}

Timeout Pattern

The Problem It Solves

Calls to external services can hang indefinitely, blocking threads:

Database query stuck
  ↓
Thread waits forever
  ↓
Thread pool exhausted
  ↓
New requests can't get threads
  ↓
Service appears down

Configuration

application.yml:

resilience4j:
  timelimiter:
    configs:
      default:
        timeout-duration: 5s           # Max time to wait
        cancel-running-future: true    # Cancel execution if timeout

    instances:
      paymentGateway:
        timeout-duration: 10s  # Longer timeout for payments

Usage

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

    @TimeLimiter(name = "paymentGateway")
    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback")
    public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
        // TimeLimiter requires CompletableFuture
        return CompletableFuture.supplyAsync(() -> {
            log.info("Processing payment with timeout protection");

            return restClient.post()
                .uri("/api/payments")
                .body(request)
                .retrieve()
                .body(PaymentResult.class);
        });
    }

    private CompletableFuture<PaymentResult> fallback(PaymentRequest request, Exception ex) {
        log.error("Payment timed out or failed: {}", ex.getMessage());

        return CompletableFuture.completedFuture(
            new PaymentResult(null, PaymentStatus.FAILED, "Request timed out")
        );
    }
}

Why CompletableFuture: TimeLimiter needs async execution to enforce timeouts. It can't timeout synchronous blocking calls.

What cancel-running-future does: When timeout occurs, it interrupts the executing thread. Without this, thread keeps running even after timeout.

Rate Limiter Pattern

The Problem It Solves

Protects downstream services from being overwhelmed:

Bug causes infinite loop calling payment gateway
  ↓
Thousands of requests per second
  ↓
Gateway overloaded, crashes
  ↓
All customers affected

Configuration

application.yml:

resilience4j:
  ratelimiter:
    configs:
      default:
        limit-for-period: 100      # Max 100 calls
        limit-refresh-period: 1s   # Per 1 second window
        timeout-duration: 0s       # Don't wait if limit exceeded (fail immediately)

    instances:
      paymentGateway:
        limit-for-period: 50   # Limit to 50 calls/second for gateway

What each setting means:

limit-for-period: Maximum calls allowed in the time window
limit-refresh-period: Time window (resets counter after this duration)
timeout-duration: How long to wait for permission if limit reached (0 = fail immediately)

Usage

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

    @RateLimiter(name = "paymentGateway", fallbackMethod = "rateLimitFallback")
    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "circuitBreakerFallback")
    public PaymentResult processPayment(PaymentRequest request) {
        // Protected by rate limiter: max 50 calls/second
        return restClient.post()
            .uri("/api/payments")
            .body(request)
            .retrieve()
            .body(PaymentResult.class);
    }

    private PaymentResult rateLimitFallback(PaymentRequest request, RequestNotPermitted ex) {
        // Called when rate limit exceeded
        log.warn("Rate limit exceeded for payment gateway");

        throw new RateLimitExceededException(
            "Too many requests to payment gateway. Please try again later."
        );
    }

    private PaymentResult circuitBreakerFallback(PaymentRequest request, Exception ex) {
        // Called when circuit breaker is open or call fails
        log.error("Payment gateway unavailable: {}", ex.getMessage());

        return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");
    }
}

Multiple fallbacks: Different exceptions trigger different fallbacks. Rate limit gets specific error message, circuit breaker queues the payment.

Bulkhead Pattern

The Problem It Solves

One slow dependency exhausts all threads, blocking unrelated operations:

Payment gateway slow (taking 10s per request)
  ↓
All 100 threads waiting on payment gateway
  ↓
Customer profile API (fast, 50ms) can't get threads
  ↓
Entire service appears down

How Bulkheads Work

Isolate thread pools for different dependencies:

Thread Pool (100 threads total)
├─ Payment Gateway Bulkhead (10 threads)
├─ Account Service Bulkhead (10 threads)
└─ Other Operations (80 threads)

Payment gateway slow? Only 10 threads blocked.
Other operations still have 90 threads available.

Configuration

application.yml:

resilience4j:
  bulkhead:
    configs:
      default:
        max-concurrent-calls: 25    # Max 25 concurrent calls
        max-wait-duration: 0        # Don't wait if max reached

    instances:
      paymentGateway:
        max-concurrent-calls: 10   # Limit payment gateway to 10 concurrent

Semaphore Bulkhead (Simple)

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

    @Bulkhead(name = "paymentGateway", type = Bulkhead.Type.SEMAPHORE,
              fallbackMethod = "bulkheadFallback")
    public PaymentResult processPayment(PaymentRequest request) {
        // Only 10 concurrent calls allowed
        // 11th call gets rejected immediately
        return callGateway(request);
    }

    private PaymentResult bulkheadFallback(PaymentRequest request, BulkheadFullException ex) {
        log.warn("Bulkhead full - too many concurrent payment gateway calls");

        // Option: Queue for later processing
        queueService.enqueuePayment(request);

        return new PaymentResult(null, PaymentStatus.PENDING,
            "Payment queued due to high load");
    }
}

Semaphore vs Thread Pool: Semaphore is simpler (just counts concurrent calls). Thread pool actually isolates execution but requires async code.

Thread Pool Bulkhead (Stronger Isolation)

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

    @Bulkhead(name = "paymentGateway", type = Bulkhead.Type.THREADPOOL,
              fallbackMethod = "bulkheadFallback")
    public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
        // Runs on dedicated thread pool (isolated from main threads)
        return CompletableFuture.supplyAsync(() -> callGateway(request));
    }

    private CompletableFuture<PaymentResult> bulkheadFallback(
            PaymentRequest request, Exception ex) {

        return CompletableFuture.completedFuture(
            new PaymentResult(null, PaymentStatus.PENDING, "Service busy")
        );
    }
}

Thread pool config (application.yml):

resilience4j:
  thread-pool-bulkhead:
    configs:
      default:
        core-thread-pool-size: 5      # Minimum threads
        max-thread-pool-size: 10      # Maximum threads
        queue-capacity: 20            # Queue size for waiting tasks

Combining Patterns

Real production code uses multiple patterns together:

@Service
@RequiredArgsConstructor
@Slf4j
public class ResilientPaymentGatewayClient {

    @TimeLimiter(name = "paymentGateway")           // 1. Timeout after 10s
    @Bulkhead(name = "paymentGateway",              // 2. Limit to 10 concurrent
              type = Bulkhead.Type.THREADPOOL)
    @RateLimiter(name = "paymentGateway")           // 3. Max 50 calls/second
    @Retry(name = "paymentGateway")                 // 4. Retry transient failures
    @CircuitBreaker(name = "paymentGateway",        // 5. Open circuit if too many failures
                    fallbackMethod = "fallback")
    public CompletableFuture<PaymentResult> processPayment(PaymentRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            log.info("Processing payment with full resilience stack");

            return restClient.post()
                .uri("/api/payments")
                .body(request)
                .retrieve()
                .body(PaymentResult.class);
        });
    }

    private CompletableFuture<PaymentResult> fallback(
            PaymentRequest request, Exception ex) {

        log.error("Payment processing failed after all resilience attempts", ex);

        // Determine appropriate fallback based on exception type
        if (ex instanceof BulkheadFullException) {
            // Too many concurrent requests
            return CompletableFuture.completedFuture(
                new PaymentResult(null, PaymentStatus.PENDING, "System busy - queued")
            );
        } else if (ex instanceof RequestNotPermitted) {
            // Rate limit or circuit open
            return CompletableFuture.completedFuture(
                new PaymentResult(null, PaymentStatus.FAILED, "Service temporarily unavailable")
            );
        } else {
            // Generic failure
            return CompletableFuture.completedFuture(
                new PaymentResult(null, PaymentStatus.FAILED, "Payment failed")
            );
        }
    }
}

Execution order (inside to outside):

TimeLimiter wraps everything - enforces 10s timeout
Bulkhead limits to 10 concurrent calls in dedicated thread pool
RateLimiter checks if under 50 calls/second
Retry attempts call up to 3 times with backoff
CircuitBreaker tracks failures, opens if too many

Why this order: Each pattern protects against a different failure mode. Together they provide defense in depth.

Graceful Degradation

When primary service fails, provide reduced functionality:

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentService {

    private final PaymentGatewayClient primaryGateway;
    private final PaymentGatewayClient secondaryGateway;
    private final PaymentQueueService queueService;
    private final PaymentCache cache;

    public PaymentResult processPayment(PaymentRequest request) {
        try {
            // Try primary gateway
            return primaryGateway.processPayment(request).get();

        } catch (Exception ex) {
            log.warn("Primary gateway failed, trying secondary: {}", ex.getMessage());

            try {
                // Fallback to secondary gateway
                return secondaryGateway.processPayment(request).get();

            } catch (Exception ex2) {
                log.error("Both gateways failed, queuing for later", ex2);

                // Last resort: queue for background processing
                queueService.enqueuePayment(request);

                return new PaymentResult(
                    null,
                    PaymentStatus.PENDING,
                    "Payment queued for processing"
                );
            }
        }
    }

    public PaymentResult getPaymentStatus(String paymentId) {
        try {
            // Try live lookup first
            return primaryGateway.getStatus(paymentId).get();

        } catch (Exception ex) {
            log.warn("Gateway unavailable, returning cached status");

            // Fallback to cache (may be stale but better than nothing)
            return cache.getPaymentStatus(paymentId)
                .orElseThrow(() -> new PaymentNotFoundException(paymentId));
        }
    }
}

Degradation levels:

Primary gateway (best)
Secondary gateway (backup)
Queue for later (deferred)
Cached data (stale but available)
Error (last resort)

Summary

Resilience Patterns:

Circuit Breaker: Stop calling failing services, fail fast
Retry: Automatically retry transient failures with exponential backoff
Timeout: Don't wait forever, cancel slow operations
Rate Limiter: Protect downstream services from overload
Bulkhead: Isolate thread pools to prevent cascade failures

When to Use Each:

Circuit Breaker: External service calls (API, database, cache)
Retry: Network errors, temporary failures
Timeout: Any external call (prevent indefinite waits)
Rate Limiter: Calls to rate-limited APIs, protecting shared resources
Bulkhead: Critical operations that must not block other operations

Best Practices:

Combine patterns for defense in depth
Always provide fallback methods
Monitor circuit breaker state transitions
Use exponential backoff for retries
Set realistic timeouts (not too short, not too long)
Test resilience with chaos engineering (simulate failures)

Cross-References:

See Spring Boot Observability for monitoring and alerting on resilience patterns
See Performance Optimization for optimizing system performance
See Performance Testing for load testing resilience patterns
See Spring Boot General for application setup and configuration
See Microservices Architecture for distributed system patterns
See Event-Driven Architecture for async communication patterns

Overview​

Core Principles​

Understanding the Problem​

Setup​

Circuit Breaker Pattern​

The Problem It Solves​

How Circuit Breakers Work​

Configuration​

Basic Usage​

Monitoring Circuit State​

Retry Pattern​

The Problem It Solves​

How Retries Work​

Configuration​

Usage with Circuit Breaker​

Custom Retry Logic​

Timeout Pattern​

The Problem It Solves​

Configuration​

Usage​

Rate Limiter Pattern​

The Problem It Solves​

Configuration​

Usage​

Bulkhead Pattern​

The Problem It Solves​

How Bulkheads Work​

Configuration​

Semaphore Bulkhead (Simple)​

Thread Pool Bulkhead (Stronger Isolation)​

Combining Patterns​

Graceful Degradation​

Summary​

Overview

Core Principles

Understanding the Problem

Setup

Circuit Breaker Pattern

The Problem It Solves

How Circuit Breakers Work

Configuration

Basic Usage

Monitoring Circuit State

Retry Pattern

The Problem It Solves

How Retries Work

Configuration

Usage with Circuit Breaker

Custom Retry Logic

Timeout Pattern

The Problem It Solves

Configuration

Usage

Rate Limiter Pattern

The Problem It Solves

Configuration

Usage

Bulkhead Pattern

The Problem It Solves

How Bulkheads Work

Configuration

Semaphore Bulkhead (Simple)

Thread Pool Bulkhead (Stronger Isolation)

Combining Patterns

Graceful Degradation

Summary