Spring Boot Resilience Patterns
Building fault-tolerant Spring Boot applications that gracefully handle failures in external dependencies.
Overview
Distributed systems fail. Networks are unreliable, services go down, databases become overloaded. Resilience patterns help your application survive these failures without cascading outages or poor user experience.
This guide covers practical resilience patterns using Resilience4j: circuit breakers, retries, timeouts, rate limiting, and bulkheads. Each pattern solves a specific failure scenario.
Core Principles
- Fail fast: Don't wait for timeouts, detect failures quickly
- Isolate failures: One failing dependency shouldn't bring down the whole system
- Graceful degradation: Provide reduced functionality instead of complete failure
- Self-healing: Automatically recover when dependencies come back online
- Avoid retry storms: Back off and give failing services time to recover
Understanding the Problem
Without resilience patterns, a single slow/failing dependency can bring down your entire application:
Your Service → Slow Database
↓
Thread pool exhausted waiting for DB responses
↓
All requests blocked (even ones not using the DB)
↓
Complete service outage
Resilience patterns prevent this cascade by limiting blast radius and enabling recovery.
Setup
build.gradle:
dependencies {
implementation 'org.springframework.boot:spring-boot-starter-aop' // Required
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.3.0'
implementation 'io.github.resilience4j:resilience4j-circuitbreaker:2.3.0'
implementation 'io.github.resilience4j:resilience4j-retry:2.3.0'
implementation 'io.github.resilience4j:resilience4j-timelimiter:2.3.0'
implementation 'io.github.resilience4j:resilience4j-ratelimiter:2.3.0'
implementation 'io.github.resilience4j:resilience4j-bulkhead:2.3.0'
}
Why AOP is required: Resilience4j uses Spring AOP to intercept method calls and wrap them with resilience logic. Without it, the annotations won't work.
Circuit Breaker Pattern
The Problem It Solves
When a dependency is down, you keep calling it and waiting for timeouts. This wastes resources and delays failure detection:
Payment Service calls Account Service (down)
→ Wait 5 seconds for timeout
→ Retry 3 times = 15 seconds wasted per request
→ Hundreds of threads blocked waiting
→ System grinds to halt
How Circuit Breakers Work
A circuit breaker tracks failures and "opens" after a threshold, immediately failing requests without calling the dependency:
States explained:
- Closed: Everything working, calls pass through
- Open: Too many failures, calls fail immediately (no waiting)
- Half-Open: Testing if dependency recovered (allow limited calls)
Configuration
application.yml:
resilience4j:
circuitbreaker:
configs:
default:
sliding-window-size: 10 # Track last 10 calls
failure-rate-threshold: 50 # Open if ≥50% fail
wait-duration-in-open-state: 30s # Wait 30s before testing recovery
permitted-number-of-calls-in-half-open-state: 3 # Test with 3 calls
slow-call-duration-threshold: 2s # Calls >2s counted as slow
slow-call-rate-threshold: 50 # Open if ≥50% slow
instances:
paymentGateway:
base-config: default
failure-rate-threshold: 60 # Override: tolerate more failures for this service
What each setting means:
sliding-window-size: How many recent calls to track (rolling window)failure-rate-threshold: Percent failures needed to open circuit (50% = half)wait-duration-in-open-state: How long to wait before testing recoverypermitted-number-of-calls-in-half-open-state: Limited calls to test if service recoveredslow-call-duration-threshold: Timeouts >2s are considered failuresslow-call-rate-threshold: Too many slow calls also opens circuit
Basic Usage
@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {
private final RestClient restClient;
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "processPaymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
log.info("Calling payment gateway: amount={}", request.amount());
// This call is protected by circuit breaker
// If gateway is down, circuit opens and this isn't called
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}
// Fallback method - called when circuit is open or call fails
private PaymentResult processPaymentFallback(PaymentRequest request, Exception ex) {
log.error("Payment gateway unavailable, using fallback: {}", ex.getMessage());
// Option 1: Return cached/default response
return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");
// Option 2: Throw business exception
// throw new PaymentGatewayUnavailableException("Gateway down", ex);
}
}
How this works:
- First few calls to gateway fail
- After failure threshold (50%), circuit opens
- Next calls don't reach gateway - immediately return fallback
- After 30 seconds, circuit goes half-open
- If test calls succeed, circuit closes (back to normal)
- If test calls fail, circuit stays open for another 30 seconds
Fallback method signature: Must match original method parameters + Exception at end.
Monitoring Circuit State
@Component
@RequiredArgsConstructor
public class CircuitBreakerMonitor {
private final CircuitBreakerRegistry circuitBreakerRegistry;
@PostConstruct
public void registerEventListeners() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.getEventPublisher()
.onStateTransition(event -> {
// Circuit state changed (CLOSED → OPEN, etc.)
log.warn("Circuit breaker state transition: {} -> {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
// Alert operations team
if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertOps("Payment gateway circuit opened - service may be down");
}
})
.onFailureRateExceeded(event -> {
log.error("Failure rate exceeded: {}%", event.getFailureRate());
})
.onSlowCallRateExceeded(event -> {
log.warn("Slow call rate exceeded: {}%", event.getSlowCallRate());
});
}
}
Why monitor state transitions: Circuit opening means a dependency is failing. This is an operational alert - someone needs to investigate.
Retry Pattern
The Problem It Solves
Network glitches cause transient failures - temporary errors that resolve themselves:
Call fails due to network blip
↓
Without retry: User sees error
↓
With retry: Second call succeeds, user doesn't notice
How Retries Work
Automatically retry failed calls with exponential backoff to avoid overwhelming the failing service:
First attempt: Fails
Wait 1 second
Second attempt: Fails
Wait 2 seconds (exponential backoff)
Third attempt: Succeeds! ✓
Configuration
application.yml:
resilience4j:
retry:
configs:
default:
max-attempts: 3 # Try 3 times total
wait-duration: 1s # Initial wait between retries
enable-exponential-backoff: true # Increase wait each retry
exponential-backoff-multiplier: 2 # Double wait time each retry
retry-exceptions: # Only retry these exceptions
- java.net.SocketTimeoutException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
ignore-exceptions: # Never retry these
- com.bank.payments.exception.ValidationException
- java.lang.IllegalArgumentException
instances:
paymentGateway:
base-config: default
max-attempts: 5 # More retries for payment gateway
wait-duration: 2s # Longer initial wait
Why exponential backoff: Prevents retry storms. If service is overloaded, constant retries make it worse. Exponential backoff gives it time to recover.
Why ignore-exceptions: Some errors are permanent (bad request, validation failure). Retrying them wastes time and resources.
Usage with Circuit Breaker
@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {
@Retry(name = "paymentGateway") // Applied first
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback") // Applied second
public PaymentResult processPayment(PaymentRequest request) {
log.info("Attempting payment gateway call (will retry if fails)");
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}
private PaymentResult fallback(PaymentRequest request, Exception ex) {
log.error("All retries exhausted, circuit may open: {}", ex.getMessage());
return new PaymentResult(null, PaymentStatus.FAILED, "Service unavailable");
}
}
Order matters:
@Retryruns first - retries on transient failures- If all retries fail,
@CircuitBreakersees it as one failure - After multiple such failures, circuit opens
- When circuit is open, retries don't even happen (fail fast)
Custom Retry Logic
Sometimes you need different retry strategies:
@Service
@RequiredArgsConstructor
public class PaymentService {
private final RetryRegistry retryRegistry;
public PaymentResult processWithCustomRetry(PaymentRequest request) {
// Create custom retry for this specific call
Retry retry = retryRegistry.retry("payment-" + request.customerId(),
RetryConfig.custom()
.maxAttempts(5)
.waitDuration(Duration.ofMillis(500))
.retryOnException(ex -> {
// Custom logic: only retry on timeout, not business errors
return ex instanceof SocketTimeoutException;
})
.onRetry(event -> {
log.warn("Retry attempt {}: {}",
event.getNumberOfRetryAttempts(),
event.getLastThrowable().getMessage());
})
.build()
);
// Wrap call in retry logic
return retry.executeCallable(() -> callPaymentGateway(request));
}
}
Timeout Pattern
The Problem It Solves
Calls to external services can hang indefinitely, blocking threads:
Database query stuck
↓
Thread waits forever
↓
Thread pool exhausted
↓
New requests can't get threads
↓
Service appears down
Configuration
application.yml:
resilience4j:
timelimiter:
configs:
default:
timeout-duration: 5s # Max time to wait
cancel-running-future: true # Cancel execution if timeout
instances:
paymentGateway:
timeout-duration: 10s # Longer timeout for payments
Usage
@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {
@TimeLimiter(name = "paymentGateway")
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback")
public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
// TimeLimiter requires CompletableFuture
return CompletableFuture.supplyAsync(() -> {
log.info("Processing payment with timeout protection");
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
});
}
private CompletableFuture<PaymentResult> fallback(PaymentRequest request, Exception ex) {
log.error("Payment timed out or failed: {}", ex.getMessage());
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Request timed out")
);
}
}
Why CompletableFuture: TimeLimiter needs async execution to enforce timeouts. It can't timeout synchronous blocking calls.
What cancel-running-future does: When timeout occurs, it interrupts the executing thread. Without this, thread keeps running even after timeout.
Rate Limiter Pattern
The Problem It Solves
Protects downstream services from being overwhelmed:
Bug causes infinite loop calling payment gateway
↓
Thousands of requests per second
↓
Gateway overloaded, crashes
↓
All customers affected
Configuration
application.yml:
resilience4j:
ratelimiter:
configs:
default:
limit-for-period: 100 # Max 100 calls
limit-refresh-period: 1s # Per 1 second window
timeout-duration: 0s # Don't wait if limit exceeded (fail immediately)
instances:
paymentGateway:
limit-for-period: 50 # Limit to 50 calls/second for gateway
What each setting means:
limit-for-period: Maximum calls allowed in the time windowlimit-refresh-period: Time window (resets counter after this duration)timeout-duration: How long to wait for permission if limit reached (0 = fail immediately)
Usage
@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentGatewayClient {
@RateLimiter(name = "paymentGateway", fallbackMethod = "rateLimitFallback")
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "circuitBreakerFallback")
public PaymentResult processPayment(PaymentRequest request) {
// Protected by rate limiter: max 50 calls/second
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
}
private PaymentResult rateLimitFallback(PaymentRequest request, RequestNotPermitted ex) {
// Called when rate limit exceeded
log.warn("Rate limit exceeded for payment gateway");
throw new RateLimitExceededException(
"Too many requests to payment gateway. Please try again later."
);
}
private PaymentResult circuitBreakerFallback(PaymentRequest request, Exception ex) {
// Called when circuit breaker is open or call fails
log.error("Payment gateway unavailable: {}", ex.getMessage());
return new PaymentResult(null, PaymentStatus.PENDING, "Payment queued");
}
}
Multiple fallbacks: Different exceptions trigger different fallbacks. Rate limit gets specific error message, circuit breaker queues the payment.
Bulkhead Pattern
The Problem It Solves
One slow dependency exhausts all threads, blocking unrelated operations:
Payment gateway slow (taking 10s per request)
↓
All 100 threads waiting on payment gateway
↓
Customer profile API (fast, 50ms) can't get threads
↓
Entire service appears down
How Bulkheads Work
Isolate thread pools for different dependencies:
Thread Pool (100 threads total)
├─ Payment Gateway Bulkhead (10 threads)
├─ Account Service Bulkhead (10 threads)
└─ Other Operations (80 threads)
Payment gateway slow? Only 10 threads blocked.
Other operations still have 90 threads available.
Configuration
application.yml:
resilience4j:
bulkhead:
configs:
default:
max-concurrent-calls: 25 # Max 25 concurrent calls
max-wait-duration: 0 # Don't wait if max reached
instances:
paymentGateway:
max-concurrent-calls: 10 # Limit payment gateway to 10 concurrent
Semaphore Bulkhead (Simple)
@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {
@Bulkhead(name = "paymentGateway", type = Bulkhead.Type.SEMAPHORE,
fallbackMethod = "bulkheadFallback")
public PaymentResult processPayment(PaymentRequest request) {
// Only 10 concurrent calls allowed
// 11th call gets rejected immediately
return callGateway(request);
}
private PaymentResult bulkheadFallback(PaymentRequest request, BulkheadFullException ex) {
log.warn("Bulkhead full - too many concurrent payment gateway calls");
// Option: Queue for later processing
queueService.enqueuePayment(request);
return new PaymentResult(null, PaymentStatus.PENDING,
"Payment queued due to high load");
}
}
Semaphore vs Thread Pool: Semaphore is simpler (just counts concurrent calls). Thread pool actually isolates execution but requires async code.
Thread Pool Bulkhead (Stronger Isolation)
@Service
@RequiredArgsConstructor
public class PaymentGatewayClient {
@Bulkhead(name = "paymentGateway", type = Bulkhead.Type.THREADPOOL,
fallbackMethod = "bulkheadFallback")
public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest request) {
// Runs on dedicated thread pool (isolated from main threads)
return CompletableFuture.supplyAsync(() -> callGateway(request));
}
private CompletableFuture<PaymentResult> bulkheadFallback(
PaymentRequest request, Exception ex) {
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.PENDING, "Service busy")
);
}
}
Thread pool config (application.yml):
resilience4j:
thread-pool-bulkhead:
configs:
default:
core-thread-pool-size: 5 # Minimum threads
max-thread-pool-size: 10 # Maximum threads
queue-capacity: 20 # Queue size for waiting tasks
Combining Patterns
Real production code uses multiple patterns together:
@Service
@RequiredArgsConstructor
@Slf4j
public class ResilientPaymentGatewayClient {
@TimeLimiter(name = "paymentGateway") // 1. Timeout after 10s
@Bulkhead(name = "paymentGateway", // 2. Limit to 10 concurrent
type = Bulkhead.Type.THREADPOOL)
@RateLimiter(name = "paymentGateway") // 3. Max 50 calls/second
@Retry(name = "paymentGateway") // 4. Retry transient failures
@CircuitBreaker(name = "paymentGateway", // 5. Open circuit if too many failures
fallbackMethod = "fallback")
public CompletableFuture<PaymentResult> processPayment(PaymentRequest request) {
return CompletableFuture.supplyAsync(() -> {
log.info("Processing payment with full resilience stack");
return restClient.post()
.uri("/api/payments")
.body(request)
.retrieve()
.body(PaymentResult.class);
});
}
private CompletableFuture<PaymentResult> fallback(
PaymentRequest request, Exception ex) {
log.error("Payment processing failed after all resilience attempts", ex);
// Determine appropriate fallback based on exception type
if (ex instanceof BulkheadFullException) {
// Too many concurrent requests
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.PENDING, "System busy - queued")
);
} else if (ex instanceof RequestNotPermitted) {
// Rate limit or circuit open
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Service temporarily unavailable")
);
} else {
// Generic failure
return CompletableFuture.completedFuture(
new PaymentResult(null, PaymentStatus.FAILED, "Payment failed")
);
}
}
}
Execution order (inside to outside):
- TimeLimiter wraps everything - enforces 10s timeout
- Bulkhead limits to 10 concurrent calls in dedicated thread pool
- RateLimiter checks if under 50 calls/second
- Retry attempts call up to 3 times with backoff
- CircuitBreaker tracks failures, opens if too many
Why this order: Each pattern protects against a different failure mode. Together they provide defense in depth.
Graceful Degradation
When primary service fails, provide reduced functionality:
@Service
@RequiredArgsConstructor
@Slf4j
public class PaymentService {
private final PaymentGatewayClient primaryGateway;
private final PaymentGatewayClient secondaryGateway;
private final PaymentQueueService queueService;
private final PaymentCache cache;
public PaymentResult processPayment(PaymentRequest request) {
try {
// Try primary gateway
return primaryGateway.processPayment(request).get();
} catch (Exception ex) {
log.warn("Primary gateway failed, trying secondary: {}", ex.getMessage());
try {
// Fallback to secondary gateway
return secondaryGateway.processPayment(request).get();
} catch (Exception ex2) {
log.error("Both gateways failed, queuing for later", ex2);
// Last resort: queue for background processing
queueService.enqueuePayment(request);
return new PaymentResult(
null,
PaymentStatus.PENDING,
"Payment queued for processing"
);
}
}
}
public PaymentResult getPaymentStatus(String paymentId) {
try {
// Try live lookup first
return primaryGateway.getStatus(paymentId).get();
} catch (Exception ex) {
log.warn("Gateway unavailable, returning cached status");
// Fallback to cache (may be stale but better than nothing)
return cache.getPaymentStatus(paymentId)
.orElseThrow(() -> new PaymentNotFoundException(paymentId));
}
}
}
Degradation levels:
- Primary gateway (best)
- Secondary gateway (backup)
- Queue for later (deferred)
- Cached data (stale but available)
- Error (last resort)
Summary
Resilience Patterns:
- Circuit Breaker: Stop calling failing services, fail fast
- Retry: Automatically retry transient failures with exponential backoff
- Timeout: Don't wait forever, cancel slow operations
- Rate Limiter: Protect downstream services from overload
- Bulkhead: Isolate thread pools to prevent cascade failures
When to Use Each:
- Circuit Breaker: External service calls (API, database, cache)
- Retry: Network errors, temporary failures
- Timeout: Any external call (prevent indefinite waits)
- Rate Limiter: Calls to rate-limited APIs, protecting shared resources
- Bulkhead: Critical operations that must not block other operations
Best Practices:
- Combine patterns for defense in depth
- Always provide fallback methods
- Monitor circuit breaker state transitions
- Use exponential backoff for retries
- Set realistic timeouts (not too short, not too long)
- Test resilience with chaos engineering (simulate failures)
Cross-References:
- See Spring Boot Observability for monitoring and alerting on resilience patterns
- See Performance Optimization for optimizing system performance
- See Performance Testing for load testing resilience patterns
- See Spring Boot General for application setup and configuration
- See Microservices Architecture for distributed system patterns
- See Event-Driven Architecture for async communication patterns