Skip to main content

Spring Boot Resilience Patterns

Building fault-tolerant Spring Boot applications that gracefully handle failures in external dependencies.

Overview

Distributed systems fail. Networks are unreliable, services go down, databases become overloaded. Resilience patterns help your application survive these failures without cascading outages or poor user experience.

This guide covers practical resilience patterns using Resilience4j: circuit breakers, retries, timeouts, rate limiting, and bulkheads. Each pattern solves a specific failure scenario.


Core Principles

  • Fail fast: Don't wait for timeouts, detect failures quickly
  • Isolate failures: One failing dependency shouldn't bring down the whole system
  • Graceful degradation: Provide reduced functionality instead of complete failure
  • Self-healing: Automatically recover when dependencies come back online
  • Avoid retry storms: Back off and give failing services time to recover

Understanding the Problem

Without resilience patterns, a single slow/failing dependency can bring down your entire application:

Your Service → Slow Database

Thread pool exhausted waiting for DB responses

All requests blocked (even ones not using the DB)

Complete service outage

Resilience patterns prevent this cascade by limiting blast radius and enabling recovery.


Setup

build.gradle:

dependencies {
implementation 'org.springframework.boot:spring-boot-starter-aop' // Required
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.3.0'
implementation 'io.github.resilience4j:resilience4j-circuitbreaker:2.3.0'
implementation 'io.github.resilience4j:resilience4j-retry:2.3.0'
implementation 'io.github.resilience4j:resilience4j-timelimiter:2.3.0'
implementation 'io.github.resilience4j:resilience4j-ratelimiter:2.3.0'
implementation 'io.github.resilience4j:resilience4j-bulkhead:2.3.0'
}

Why AOP is required: Resilience4j uses Spring AOP to intercept method calls and wrap them with resilience logic. Without it, the annotations won't work.


Circuit Breaker Pattern

The Problem It Solves

When a dependency is down, you keep calling it and waiting for timeouts. This wastes resources and delays failure detection:

Payment Service calls Account Service (down)
→ Wait 5 seconds for timeout
→ Retry 3 times = 15 seconds wasted per request
→ Hundreds of threads blocked waiting
→ System grinds to halt

How Circuit Breakers Work

A circuit breaker tracks failures and "opens" after a threshold, immediately failing requests without calling the dependency:

States explained:

  • Closed: Everything working, calls pass through
  • Open: Too many failures, calls fail immediately (no waiting)
  • Half-Open: Testing if dependency recovered (allow limited calls)

Configuration

application.yml:

resilience4j:
circuitbreaker:
configs:
default:
sliding-window-size: 10 # Track last 10 calls
failure-rate-threshold: 50 # Open if ≥50% fail
wait-duration-in-open-state: 30s # Wait 30s before testing recovery
permitted-number-of-calls-in-half-open-state: 3 # Test with 3 calls
slow-call-duration-threshold: 2s # Calls >2s counted as slow
slow-call-rate-threshold: 50 # Open if ≥50% slow

instances:
paymentGateway:
base-config: default
failure-rate-threshold: 60 # Override: tolerate more failures for this service

What each setting means:

  • sliding-window-size: How many recent calls to track (rolling window)
  • failure-rate-threshold: Percent failures needed to open circuit (50% = half)
  • wait-duration-in-open-state: How long to wait before testing recovery
  • permitted-number-of-calls-in-half-open-state: Limited calls to test if service recovered
  • slow-call-duration-threshold: Timeouts >2s are considered failures
  • slow-call-rate-threshold: Too many slow calls also opens circuit

Basic Usage

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

private final RestClient restClient;

@CircuitBreaker(name = "paymentGateway", fallbackMethod = "processPaymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
log.info("Calling payment gateway: amount={}", request.amount());

// This call is protected by circuit breaker
// If gateway is down, circuit opens and this isn't called
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}

// Fallback method - called when circuit is open or call fails
private PaymentResult processPaymentFallback(PaymentRequest request, Exception ex) {
log.error("Payment gateway unavailable, using fallback: {}", ex.getMessage());

// Option 1: Return cached/default response
return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");

// Option 2: Throw business exception
// throw new PaymentGatewayUnavailableException("Gateway down", ex);
}
}

How this works:

  1. First few calls to gateway fail
  2. After failure threshold (50%), circuit opens
  3. Next calls don't reach gateway - immediately return fallback
  4. After 30 seconds, circuit goes half-open
  5. If test calls succeed, circuit closes (back to normal)
  6. If test calls fail, circuit stays open for another 30 seconds

Fallback method signature: Must match original method parameters + Exception at end.

Monitoring Circuit State

@Component
@RequiredArgsConstructor
public class CircuitBreakerMonitor {

private final CircuitBreakerRegistry circuitBreakerRegistry;

@PostConstruct
public void registerEventListeners() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

cb.getEventPublisher()
.onStateTransition(event -> {
// Circuit state changed (CLOSED → OPEN, etc.)
log.warn("Circuit breaker state transition: {} -> {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());

// Alert operations team
if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertOps("Payment gateway circuit opened - service may be down");
}
})
.onFailureRateExceeded(event -> {
log.error("Failure rate exceeded: {}%", event.getFailureRate());
})
.onSlowCallRateExceeded(event -> {
log.warn("Slow call rate exceeded: {}%", event.getSlowCallRate());
});
}
}

Why monitor state transitions: Circuit opening means a dependency is failing. This is an operational alert - someone needs to investigate.


Retry Pattern

The Problem It Solves

Network glitches cause transient failures - temporary errors that resolve themselves:

Call fails due to network blip

Without retry: User sees error

With retry: Second call succeeds, user doesn't notice

How Retries Work

Automatically retry failed calls with exponential backoff to avoid overwhelming the failing service:

First attempt:  Fails
Wait 1 second
Second attempt: Fails
Wait 2 seconds (exponential backoff)
Third attempt: Succeeds! ✓

Configuration

application.yml:

resilience4j:
retry:
configs:
default:
max-attempts: 3 # Try 3 times total
wait-duration: 1s # Initial wait between retries
enable-exponential-backoff: true # Increase wait each retry
exponential-backoff-multiplier: 2 # Double wait time each retry
retry-exceptions: # Only retry these exceptions
- java.net.SocketTimeoutException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
ignore-exceptions: # Never retry these
- com.bank.payments.exception.ValidationException
- java.lang.IllegalArgumentException

instances:
paymentGateway:
base-config: default
max-attempts: 5 # More retries for payment gateway
wait-duration: 2s # Longer initial wait

Why exponential backoff: Prevents retry storms. If service is overloaded, constant retries make it worse. Exponential backoff gives it time to recover.

Why ignore-exceptions: Some errors are permanent (bad request, validation failure). Retrying them wastes time and resources.

Usage with Circuit Breaker

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

@Retry(name = "paymentGateway") // Applied first
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback") // Applied second
public PaymentResult processPayment(PaymentRequest request) {
log.info("Attempting payment gateway call (will retry if fails)");

return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}

private PaymentResult fallback(PaymentRequest request, Exception ex) {
log.error("All retries exhausted, circuit may open: {}", ex.getMessage());
return new PaymentResult(null, PaymentStatus.FAILED, "Service unavailable");
}
}

Order matters:

  1. @Retry runs first - retries on transient failures
  2. If all retries fail, @CircuitBreaker sees it as one failure
  3. After multiple such failures, circuit opens
  4. When circuit is open, retries don't even happen (fail fast)

Custom Retry Logic

Sometimes you need different retry strategies:

@Service
@RequiredArgsConstructor
public class PaymentService {

private final RetryRegistry retryRegistry;

public PaymentResult processWithCustomRetry(PaymentRequest request) {
// Create custom retry for this specific call
Retry retry = retryRegistry.retry("payment-" + request.customerId(),
RetryConfig.custom()
.maxAttempts(5)
.waitDuration(Duration.ofMillis(500))
.retryOnException(ex -> {
// Custom logic: only retry on timeout, not business errors
return ex instanceof SocketTimeoutException;
})
.onRetry(event -> {
log.warn("Retry attempt {}: {}",
event.getNumberOfRetryAttempts(),
event.getLastThrowable().getMessage());
})
.build()
);

// Wrap call in retry logic
return retry.executeCallable(() -> callPaymentGateway(request));
}
}

Timeout Pattern

The Problem It Solves

Calls to external services can hang indefinitely, blocking threads:

Database query stuck

Thread waits forever

Thread pool exhausted

New requests can't get threads

Service appears down

Configuration

application.yml:

resilience4j:
timelimiter:
configs:
default:
timeout-duration: 5s # Max time to wait
cancel-running-future: true # Cancel execution if timeout

instances:
paymentGateway:
timeout-duration: 10s # Longer timeout for payments

Usage

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

@TimeLimiter(name = "paymentGateway")
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback")
public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
// TimeLimiter requires CompletableFuture
return CompletableFuture.supplyAsync(() -> {
log.info("Processing payment with timeout protection");

return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
});
}

private CompletableFuture<PaymentResult> fallback(PaymentRequest request, Exception ex) {
log.error("Payment timed out or failed: {}", ex.getMessage());

return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Request timed out")
);
}
}

Why CompletableFuture: TimeLimiter needs async execution to enforce timeouts. It can't timeout synchronous blocking calls.

What cancel-running-future does: When timeout occurs, it interrupts the executing thread. Without this, thread keeps running even after timeout.


Rate Limiter Pattern

The Problem It Solves

Protects downstream services from being overwhelmed:

Bug causes infinite loop calling payment gateway

Thousands of requests per second

Gateway overloaded, crashes

All customers affected

Configuration

application.yml:

resilience4j:
ratelimiter:
configs:
default:
limit-for-period: 100 # Max 100 calls
limit-refresh-period: 1s # Per 1 second window
timeout-duration: 0s # Don't wait if limit exceeded (fail immediately)

instances:
paymentGateway:
limit-for-period: 50 # Limit to 50 calls/second for gateway

What each setting means:

  • limit-for-period: Maximum calls allowed in the time window
  • limit-refresh-period: Time window (resets counter after this duration)
  • timeout-duration: How long to wait for permission if limit reached (0 = fail immediately)

Usage

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {

@RateLimiter(name = "paymentGateway", fallbackMethod = "rateLimitFallback")
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "circuitBreakerFallback")
public PaymentResult processPayment(PaymentRequest request) {
// Protected by rate limiter: max 50 calls/second
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}

private PaymentResult rateLimitFallback(PaymentRequest request, RequestNotPermitted ex) {
// Called when rate limit exceeded
log.warn("Rate limit exceeded for payment gateway");

throw new RateLimitExceededException(
"Too many requests to payment gateway. Please try again later."
);
}

private PaymentResult circuitBreakerFallback(PaymentRequest request, Exception ex) {
// Called when circuit breaker is open or call fails
log.error("Payment gateway unavailable: {}", ex.getMessage());

return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");
}
}

Multiple fallbacks: Different exceptions trigger different fallbacks. Rate limit gets specific error message, circuit breaker queues the payment.


Bulkhead Pattern

The Problem It Solves

One slow dependency exhausts all threads, blocking unrelated operations:

Payment gateway slow (taking 10s per request)

All 100 threads waiting on payment gateway

Customer profile API (fast, 50ms) can't get threads

Entire service appears down

How Bulkheads Work

Isolate thread pools for different dependencies:

Thread Pool (100 threads total)
├─ Payment Gateway Bulkhead (10 threads)
├─ Account Service Bulkhead (10 threads)
└─ Other Operations (80 threads)

Payment gateway slow? Only 10 threads blocked.
Other operations still have 90 threads available.

Configuration

application.yml:

resilience4j:
bulkhead:
configs:
default:
max-concurrent-calls: 25 # Max 25 concurrent calls
max-wait-duration: 0 # Don't wait if max reached

instances:
paymentGateway:
max-concurrent-calls: 10 # Limit payment gateway to 10 concurrent

Semaphore Bulkhead (Simple)

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

@Bulkhead(name = "paymentGateway", type = Bulkhead.Type.SEMAPHORE,
fallbackMethod = "bulkheadFallback")
public PaymentResult processPayment(PaymentRequest request) {
// Only 10 concurrent calls allowed
// 11th call gets rejected immediately
return callGateway(request);
}

private PaymentResult bulkheadFallback(PaymentRequest request, BulkheadFullException ex) {
log.warn("Bulkhead full - too many concurrent payment gateway calls");

// Option: Queue for later processing
queueService.enqueuePayment(request);

return new PaymentResult(null, PaymentStatus.PENDING,
"Payment queued due to high load");
}
}

Semaphore vs Thread Pool: Semaphore is simpler (just counts concurrent calls). Thread pool actually isolates execution but requires async code.

Thread Pool Bulkhead (Stronger Isolation)

@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {

@Bulkhead(name = "paymentGateway", type = Bulkhead.Type.THREADPOOL,
fallbackMethod = "bulkheadFallback")
public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
// Runs on dedicated thread pool (isolated from main threads)
return CompletableFuture.supplyAsync(() -> callGateway(request));
}

private CompletableFuture<PaymentResult> bulkheadFallback(
PaymentRequest request, Exception ex) {

return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.PENDING, "Service busy")
);
}
}

Thread pool config (application.yml):

resilience4j:
thread-pool-bulkhead:
configs:
default:
core-thread-pool-size: 5 # Minimum threads
max-thread-pool-size: 10 # Maximum threads
queue-capacity: 20 # Queue size for waiting tasks

Combining Patterns

Real production code uses multiple patterns together:

@Service
@RequiredArgsConstructor
@Slf4j
public class ResilientPaymentGatewayClient {

@TimeLimiter(name = "paymentGateway") // 1. Timeout after 10s
@Bulkhead(name = "paymentGateway", // 2. Limit to 10 concurrent
type = Bulkhead.Type.THREADPOOL)
@RateLimiter(name = "paymentGateway") // 3. Max 50 calls/second
@Retry(name = "paymentGateway") // 4. Retry transient failures
@CircuitBreaker(name = "paymentGateway", // 5. Open circuit if too many failures
fallbackMethod = "fallback")
public CompletableFuture<PaymentResult> processPayment(PaymentRequest request) {
return CompletableFuture.supplyAsync(() -> {
log.info("Processing payment with full resilience stack");

return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
});
}

private CompletableFuture<PaymentResult> fallback(
PaymentRequest request, Exception ex) {

log.error("Payment processing failed after all resilience attempts", ex);

// Determine appropriate fallback based on exception type
if (ex instanceof BulkheadFullException) {
// Too many concurrent requests
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.PENDING, "System busy - queued")
);
} else if (ex instanceof RequestNotPermitted) {
// Rate limit or circuit open
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Service temporarily unavailable")
);
} else {
// Generic failure
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Payment failed")
);
}
}
}

Execution order (inside to outside):

  1. TimeLimiter wraps everything - enforces 10s timeout
  2. Bulkhead limits to 10 concurrent calls in dedicated thread pool
  3. RateLimiter checks if under 50 calls/second
  4. Retry attempts call up to 3 times with backoff
  5. CircuitBreaker tracks failures, opens if too many

Why this order: Each pattern protects against a different failure mode. Together they provide defense in depth.


Graceful Degradation

When primary service fails, provide reduced functionality:

@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentService {

private final PaymentGatewayClient primaryGateway;
private final PaymentGatewayClient secondaryGateway;
private final PaymentQueueService queueService;
private final PaymentCache cache;

public PaymentResult processPayment(PaymentRequest request) {
try {
// Try primary gateway
return primaryGateway.processPayment(request).get();

} catch (Exception ex) {
log.warn("Primary gateway failed, trying secondary: {}", ex.getMessage());

try {
// Fallback to secondary gateway
return secondaryGateway.processPayment(request).get();

} catch (Exception ex2) {
log.error("Both gateways failed, queuing for later", ex2);

// Last resort: queue for background processing
queueService.enqueuePayment(request);

return new PaymentResult(
null,
PaymentStatus.PENDING,
"Payment queued for processing"
);
}
}
}

public PaymentResult getPaymentStatus(String paymentId) {
try {
// Try live lookup first
return primaryGateway.getStatus(paymentId).get();

} catch (Exception ex) {
log.warn("Gateway unavailable, returning cached status");

// Fallback to cache (may be stale but better than nothing)
return cache.getPaymentStatus(paymentId)
.orElseThrow(() -> new PaymentNotFoundException(paymentId));
}
}
}

Degradation levels:

  1. Primary gateway (best)
  2. Secondary gateway (backup)
  3. Queue for later (deferred)
  4. Cached data (stale but available)
  5. Error (last resort)

Summary

Resilience Patterns:

  1. Circuit Breaker: Stop calling failing services, fail fast
  2. Retry: Automatically retry transient failures with exponential backoff
  3. Timeout: Don't wait forever, cancel slow operations
  4. Rate Limiter: Protect downstream services from overload
  5. Bulkhead: Isolate thread pools to prevent cascade failures

When to Use Each:

  • Circuit Breaker: External service calls (API, database, cache)
  • Retry: Network errors, temporary failures
  • Timeout: Any external call (prevent indefinite waits)
  • Rate Limiter: Calls to rate-limited APIs, protecting shared resources
  • Bulkhead: Critical operations that must not block other operations

Best Practices:

  • Combine patterns for defense in depth
  • Always provide fallback methods
  • Monitor circuit breaker state transitions
  • Use exponential backoff for retries
  • Set realistic timeouts (not too short, not too long)
  • Test resilience with chaos engineering (simulate failures)

Cross-References: