Rate Limiting and Throttling
Rate limiting controls the frequency of requests a client can make to your services within a specified time window. This protects your infrastructure from overload, prevents abuse, ensures fair resource allocation among users, and mitigates denial-of-service attacks. Too restrictive limits frustrate legitimate users; too lenient limits fail to protect your system.
This guide covers rate limiting algorithms, implementation strategies at different architectural layers, and how to communicate limits to clients through standardized headers.
Core Concepts
Understanding the distinction between related concepts helps you choose the right approach for your requirements:
Rate Limiting vs Throttling: Rate limiting rejects requests that exceed the allowed rate with error responses (typically HTTP 429 Too Many Requests). Throttling slows down request processing by introducing delays but still processes all requests. Rate limiting is more common because it provides clear feedback and prevents resource exhaustion.
Hard vs Soft Limits: Hard limits immediately reject requests once the threshold is exceeded. Soft limits allow temporary bursts above the limit but apply penalties (slower processing, reduced priority, warnings). Most systems use hard limits for simplicity and predictability.
Global vs Per-Resource Limits: Global limits apply to all operations from a client (e.g., 1000 requests/hour total). Per-resource limits apply to specific operations (e.g., 10 login attempts/minute, 100 search queries/minute). Combining both provides granular control - global limits prevent overall abuse while per-resource limits protect expensive operations.
Burst Allowances: Many rate limiting algorithms allow short bursts above the average rate to accommodate legitimate traffic spikes. For example, a limit of 100 requests/minute might allow 20 requests in a single second, as long as the average over the minute stays below 100.
Rate Limiting Algorithms
Different algorithms provide different trade-offs between accuracy, memory usage, burst handling, and implementation complexity. Selecting the appropriate algorithm depends on your specific requirements for precision, resource availability, and desired burst behavior.
Token Bucket
The token bucket algorithm maintains a bucket that holds tokens. Tokens are added to the bucket at a constant rate up to a maximum capacity. Each request consumes one or more tokens. If sufficient tokens are available, the request is allowed and tokens are removed. If insufficient tokens exist, the request is rejected.
This algorithm naturally allows bursts up to the bucket capacity while maintaining an average rate over time. Token bucket is the most widely used rate limiting algorithm due to its simplicity and burst-handling characteristics.
// Token bucket implementation in Java
public class TokenBucket {
private final long capacity; // Maximum tokens
private final long refillRate; // Tokens added per second
private final AtomicLong tokens; // Current token count
private final AtomicLong lastRefill; // Last refill timestamp
public TokenBucket(long capacity, long refillRate) {
this.capacity = capacity;
this.refillRate = refillRate;
this.tokens = new AtomicLong(capacity);
this.lastRefill = new AtomicLong(System.nanoTime());
}
public boolean tryConsume(long tokensToConsume) {
refill();
long currentTokens = tokens.get();
if (currentTokens >= tokensToConsume) {
// Atomic compare-and-swap to handle concurrent requests
return tokens.compareAndSet(currentTokens, currentTokens - tokensToConsume);
}
return false; // Insufficient tokens
}
private void refill() {
long now = System.nanoTime();
long lastRefillTime = lastRefill.get();
// Calculate tokens to add based on elapsed time
long elapsedNanos = now - lastRefillTime;
long tokensToAdd = (elapsedNanos * refillRate) / 1_000_000_000L;
if (tokensToAdd > 0) {
long currentTokens = tokens.get();
long newTokens = Math.min(capacity, currentTokens + tokensToAdd);
if (tokens.compareAndSet(currentTokens, newTokens)) {
lastRefill.set(now);
}
}
}
public long availableTokens() {
refill();
return tokens.get();
}
}
// Usage in a service
@Service
public class RateLimitedApiService {
private final Map<String, TokenBucket> buckets = new ConcurrentHashMap<>();
private static final long RATE_LIMIT = 100; // 100 requests
private static final long TIME_WINDOW = 60; // per 60 seconds
public boolean allowRequest(String userId) {
TokenBucket bucket = buckets.computeIfAbsent(userId,
key -> new TokenBucket(RATE_LIMIT, RATE_LIMIT / TIME_WINDOW)
);
return bucket.tryConsume(1);
}
}
Token Bucket Characteristics:
- Allows bursts: Clients can consume all available tokens immediately for burst traffic
- Smooth average rate: Refill rate ensures long-term average doesn't exceed limit
- Memory efficient: Only stores current token count and last refill time per key
- Simple implementation: Straightforward logic that's easy to understand and debug
- Most common: Used by AWS API Gateway, Stripe API, many other services
Token bucket is ideal when you want to allow legitimate bursts (e.g., page load making multiple API calls) while preventing sustained abuse.
Leaky Bucket
The leaky bucket algorithm processes requests at a constant rate regardless of arrival pattern. Requests are added to a queue (the bucket), and processed at a fixed rate (the leak). If the queue is full, new requests are rejected. This smooths out bursts by enforcing a consistent processing rate.
// Leaky bucket implementation in TypeScript
class LeakyBucket {
private queue: Array<() => Promise<void>> = [];
private processing = false;
constructor(
private readonly capacity: number, // Maximum queue size
private readonly leakRate: number // Requests processed per second
) {}
async addRequest(request: () => Promise<void>): Promise<boolean> {
if (this.queue.length >= this.capacity) {
return false; // Bucket is full, reject request
}
this.queue.push(request);
this.processQueue();
return true;
}
private async processQueue(): Promise<void> {
if (this.processing || this.queue.length === 0) {
return;
}
this.processing = true;
while (this.queue.length > 0) {
const request = this.queue.shift()!;
try {
await request();
} catch (error) {
console.error('Request processing failed:', error);
}
// Wait for leak rate interval before processing next request
const intervalMs = 1000 / this.leakRate;
await new Promise(resolve => setTimeout(resolve, intervalMs));
}
this.processing = false;
}
getQueueSize(): number {
return this.queue.length;
}
}
// Usage example
const bucket = new LeakyBucket(50, 10); // 50 capacity, 10 req/sec
async function handleApiRequest(userId: string, request: () => Promise<void>) {
const allowed = await bucket.addRequest(request);
if (!allowed) {
throw new TooManyRequestsError('Rate limit exceeded, queue full');
}
}
Leaky Bucket Characteristics:
- Constant output rate: Processes requests at fixed rate regardless of input
- Smooths bursts: Queues burst traffic and processes at steady rate
- No immediate bursts: Unlike token bucket, can't immediately process multiple requests
- Queue overhead: Requires queue memory proportional to capacity
- Fairness: FIFO processing ensures fair ordering
Leaky bucket is appropriate when you need consistent, predictable load on downstream services and can tolerate queueing delay.
Fixed Window
Fixed window divides time into fixed intervals (windows) and counts requests per window. Each window has an independent counter that resets at window boundaries. This is the simplest rate limiting algorithm but has edge case issues.
// Fixed window counter using Redis
@Service
public class FixedWindowRateLimiter {
private final RedisTemplate<String, String> redisTemplate;
private static final long WINDOW_SIZE_SECONDS = 60;
private static final long MAX_REQUESTS = 100;
public boolean allowRequest(String userId) {
long currentWindow = System.currentTimeMillis() / 1000 / WINDOW_SIZE_SECONDS;
String key = String.format("rate_limit:%s:%d", userId, currentWindow);
// Increment counter for current window
Long requests = redisTemplate.opsForValue().increment(key);
if (requests == 1) {
// First request in this window, set expiration
redisTemplate.expire(key, Duration.ofSeconds(WINDOW_SIZE_SECONDS * 2));
}
return requests <= MAX_REQUESTS;
}
public long getRemainingRequests(String userId) {
long currentWindow = System.currentTimeMillis() / 1000 / WINDOW_SIZE_SECONDS;
String key = String.format("rate_limit:%s:%d", userId, currentWindow);
Long requests = redisTemplate.opsForValue().get(key) != null
? Long.parseLong(redisTemplate.opsForValue().get(key))
: 0L;
return Math.max(0, MAX_REQUESTS - requests);
}
}
Fixed Window Problem: Users can make 200 requests in 2 seconds by making 100 requests at the end of window 1 (11:59:59) and 100 requests at the start of window 2 (12:00:00). This violates the intended rate limit of 100 requests per minute.
Fixed Window Characteristics:
- Simple implementation: Single counter per window, minimal memory
- Low computational cost: Just increment and compare
- Edge case issues: Double rate at window boundaries
- Acceptable for coarse limits: Works well when precision isn't critical
Fixed window is suitable for coarse-grained rate limiting where the boundary edge case is acceptable, or when simplicity is more important than precision.
Sliding Window Log
Sliding window log maintains a timestamp log of all requests within the time window. For each new request, it removes timestamps outside the current window and checks if the remaining count exceeds the limit. This provides precise rate limiting without edge cases but requires significant memory.
// Sliding window log implementation
class SlidingWindowLog {
private requestLogs: Map<string, number[]> = new Map();
constructor(
private readonly maxRequests: number,
private readonly windowMs: number
) {}
allowRequest(userId: string): boolean {
const now = Date.now();
const windowStart = now - this.windowMs;
// Get existing log or create new one
let log = this.requestLogs.get(userId) || [];
// Remove timestamps outside current window
log = log.filter(timestamp => timestamp > windowStart);
if (log.length >= this.maxRequests) {
this.requestLogs.set(userId, log);
return false; // Rate limit exceeded
}
// Add current request timestamp
log.push(now);
this.requestLogs.set(userId, log);
return true;
}
getRemainingRequests(userId: string): number {
const now = Date.now();
const windowStart = now - this.windowMs;
let log = this.requestLogs.get(userId) || [];
log = log.filter(timestamp => timestamp > windowStart);
return Math.max(0, this.maxRequests - log.length);
}
// Cleanup old entries periodically
cleanup(): void {
const now = Date.now();
const windowStart = now - this.windowMs;
for (const [userId, log] of this.requestLogs.entries()) {
const filtered = log.filter(timestamp => timestamp > windowStart);
if (filtered.length === 0) {
this.requestLogs.delete(userId);
} else {
this.requestLogs.set(userId, filtered);
}
}
}
}
// Redis implementation for distributed scenarios
class RedisSlidingWindowLog {
constructor(
private readonly redis: Redis,
private readonly maxRequests: number,
private readonly windowMs: number
) {}
async allowRequest(userId: string): Promise<boolean> {
const key = `rate_limit:log:${userId}`;
const now = Date.now();
const windowStart = now - this.windowMs;
// Remove old timestamps and count remaining
await this.redis.zremrangebyscore(key, '-inf', windowStart);
const count = await this.redis.zcard(key);
if (count >= this.maxRequests) {
return false;
}
// Add current timestamp (score and member are same)
await this.redis.zadd(key, now, `${now}`);
await this.redis.pexpire(key, this.windowMs);
return true;
}
}
Sliding Window Log Characteristics:
- Perfect precision: No edge case issues, accurate to the millisecond
- High memory usage: Stores timestamp for every request in the window
- Scales poorly: Memory usage grows with request volume
- Distributed complexity: Requires synchronized log storage (Redis sorted sets)
Sliding window log is appropriate for critical rate limits where precision is essential and request volumes are manageable (login attempts, password resets, sensitive operations).
Sliding Window Counter
Sliding window counter approximates sliding window log with much lower memory usage. It uses two fixed windows (current and previous) and interpolates between them based on the current position within the window.
The formula estimates requests in the sliding window as:
requests = (previous_window_count * overlap_percentage) + current_window_count
// Sliding window counter implementation
@Service
public class SlidingWindowCounter {
private final RedisTemplate<String, String> redisTemplate;
private static final long WINDOW_SIZE_SECONDS = 60;
private static final long MAX_REQUESTS = 100;
public boolean allowRequest(String userId) {
long now = System.currentTimeMillis() / 1000;
long currentWindow = now / WINDOW_SIZE_SECONDS;
long previousWindow = currentWindow - 1;
String currentKey = String.format("rate_limit:%s:%d", userId, currentWindow);
String previousKey = String.format("rate_limit:%s:%d", userId, previousWindow);
// Get counts from both windows
long currentCount = increment(currentKey);
long previousCount = getCount(previousKey);
// Calculate position within current window (0.0 to 1.0)
double windowPosition = (double) (now % WINDOW_SIZE_SECONDS) / WINDOW_SIZE_SECONDS;
// Estimate total requests in sliding window
double estimatedCount = (previousCount * (1 - windowPosition)) + currentCount;
return estimatedCount <= MAX_REQUESTS;
}
private long increment(String key) {
Long count = redisTemplate.opsForValue().increment(key);
if (count == 1) {
redisTemplate.expire(key, Duration.ofSeconds(WINDOW_SIZE_SECONDS * 2));
}
return count;
}
private long getCount(String key) {
String value = redisTemplate.opsForValue().get(key);
return value != null ? Long.parseLong(value) : 0L;
}
public RateLimitInfo getRateLimitInfo(String userId) {
long now = System.currentTimeMillis() / 1000;
long currentWindow = now / WINDOW_SIZE_SECONDS;
long previousWindow = currentWindow - 1;
String currentKey = String.format("rate_limit:%s:%d", userId, currentWindow);
String previousKey = String.format("rate_limit:%s:%d", userId, previousWindow);
long currentCount = getCount(currentKey);
long previousCount = getCount(previousKey);
double windowPosition = (double) (now % WINDOW_SIZE_SECONDS) / WINDOW_SIZE_SECONDS;
double estimatedCount = (previousCount * (1 - windowPosition)) + currentCount;
long remaining = Math.max(0, MAX_REQUESTS - (long) Math.ceil(estimatedCount));
long resetTime = (currentWindow + 1) * WINDOW_SIZE_SECONDS;
return new RateLimitInfo(MAX_REQUESTS, remaining, resetTime);
}
}
record RateLimitInfo(long limit, long remaining, long reset) {}
Sliding Window Counter Characteristics:
- Good precision: Much better than fixed window, slight approximation vs log
- Memory efficient: Only two counters per user
- No edge cases: Smooth behavior at window boundaries
- Widely used: Good balance of accuracy and efficiency
Sliding window counter is the recommended algorithm for most use cases - it provides excellent precision with minimal overhead.
Implementation Strategies
Rate limiting can be implemented at different layers of your architecture. The choice depends on your requirements for centralization, latency tolerance, and infrastructure complexity.
API Gateway Rate Limiting
Implementing rate limiting at the API gateway provides centralized control and prevents rate-limited requests from reaching application servers. This is the most efficient approach for protecting infrastructure.
# Spring Cloud Gateway rate limiting configuration
spring:
cloud:
gateway:
routes:
- id: api_route
uri: lb://backend-service
predicates:
- Path=/api/**
filters:
- name: RequestRateLimiter
args:
redis-rate-limiter.replenishRate: 10 # Tokens per second
redis-rate-limiter.burstCapacity: 20 # Maximum burst
redis-rate-limiter.requestedTokens: 1 # Tokens per request
key-resolver: "#{@userKeyResolver}" # Extract user ID
# Custom key resolver for extracting rate limit key
@Component
public class UserKeyResolver implements KeyResolver {
@Override
public Mono<String> resolve(ServerWebExchange exchange) {
// Extract user ID from JWT token or API key
return exchange.getPrincipal()
.map(Principal::getName)
.defaultIfEmpty("anonymous");
}
}
Gateway Rate Limiting Advantages:
- Early rejection: Blocks requests before they consume application resources
- Centralized configuration: Single place to manage all rate limits
- Consistent enforcement: Same limits across all backend instances
- Infrastructure protection: Prevents DDoS from reaching applications
Gateway Rate Limiting Considerations:
- Single point of failure: Gateway outage affects all traffic
- Limited context: May not have business logic context for nuanced limits
- Coordination overhead: Distributed gateways need shared state (Redis)
Application-Level Rate Limiting
Implementing rate limiting within the application provides access to business context and allows fine-grained per-operation limits. This complements gateway-level limits for additional protection.
// Spring Boot application-level rate limiting
@RestController
@RequestMapping("/api")
public class UserController {
private final RateLimiter rateLimiter;
private final UserService userService;
@GetMapping("/users/{id}")
public ResponseEntity<User> getUser(@PathVariable String id, HttpServletRequest request) {
String userId = extractUserId(request);
// Check rate limit
if (!rateLimiter.allowRequest(userId, "get_user")) {
RateLimitInfo info = rateLimiter.getRateLimitInfo(userId, "get_user");
return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS)
.header("X-RateLimit-Limit", String.valueOf(info.limit()))
.header("X-RateLimit-Remaining", "0")
.header("X-RateLimit-Reset", String.valueOf(info.reset()))
.header("Retry-After", String.valueOf(info.reset() - System.currentTimeMillis() / 1000))
.build();
}
User user = userService.getUser(id);
RateLimitInfo info = rateLimiter.getRateLimitInfo(userId, "get_user");
return ResponseEntity.ok()
.header("X-RateLimit-Limit", String.valueOf(info.limit()))
.header("X-RateLimit-Remaining", String.valueOf(info.remaining()))
.header("X-RateLimit-Reset", String.valueOf(info.reset()))
.body(user);
}
// More restrictive limit for expensive operations
@PostMapping("/users/{id}/reports")
public ResponseEntity<Report> generateReport(@PathVariable String id, HttpServletRequest request) {
String userId = extractUserId(request);
// Different limit for expensive operation
if (!rateLimiter.allowRequest(userId, "generate_report")) {
throw new RateLimitExceededException("Report generation limit exceeded");
}
Report report = userService.generateReport(id);
return ResponseEntity.ok(report);
}
}
// Exception handler for rate limit violations
@ControllerAdvice
public class RateLimitExceptionHandler {
@ExceptionHandler(RateLimitExceededException.class)
public ResponseEntity<ErrorResponse> handleRateLimit(RateLimitExceededException ex) {
return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS)
.body(new ErrorResponse(
"RATE_LIMIT_EXCEEDED",
ex.getMessage(),
"Please wait before making additional requests"
));
}
}
Application-Level Advantages:
- Business context: Access to user roles, subscription tiers, operation types
- Granular control: Different limits per endpoint, user type, or operation
- Flexible logic: Custom rate limit rules based on complex criteria
- Graceful degradation: Can queue or defer rather than reject
Application-Level Considerations:
- Resource consumption: Rate-limited requests still consume network and parsing resources
- Consistency: Must coordinate across multiple instances (requires distributed state)
- Implementation overhead: More code to maintain vs gateway configuration
Distributed Rate Limiting
When running multiple application instances, rate limiting requires shared state to enforce consistent limits across all instances. Redis is the most common solution for distributed rate limiting.
// Distributed rate limiter using Redis Lua scripts for atomicity
import Redis from 'ioredis';
class DistributedRateLimiter {
private redis: Redis;
// Lua script for atomic sliding window counter
private slidingWindowScript = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local current_window = math.floor(now / window)
local previous_window = current_window - 1
local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window
local current_count = tonumber(redis.call("GET", current_key) or "0")
local previous_count = tonumber(redis.call("GET", previous_key) or "0")
local window_position = (now % window) / window
local estimated_count = math.floor((previous_count * (1 - window_position)) + current_count)
if estimated_count >= limit then
return {0, limit, 0, current_window + 1}
end
redis.call("INCR", current_key)
redis.call("EXPIRE", current_key, window * 2)
local remaining = limit - (current_count + 1)
return {1, limit, remaining, current_window + 1}
`;
constructor() {
this.redis = new Redis({
host: process.env.REDIS_HOST,
port: parseInt(process.env.REDIS_PORT || '6379'),
maxRetriesPerRequest: 3
});
this.redis.defineCommand('slidingWindowLimit', {
numberOfKeys: 1,
lua: this.slidingWindowScript
});
}
async checkLimit(
userId: string,
limit: number,
windowSeconds: number
): Promise<RateLimitResult> {
const key = `rate_limit:${userId}`;
const now = Math.floor(Date.now() / 1000);
try {
// @ts-ignore - Custom command defined above
const result = await this.redis.slidingWindowLimit(
key,
now,
windowSeconds,
limit
);
const [allowed, totalLimit, remaining, resetWindow] = result;
return {
allowed: allowed === 1,
limit: totalLimit,
remaining: remaining,
reset: resetWindow * windowSeconds
};
} catch (error) {
// Fail open: allow request if Redis is unavailable
console.error('Rate limiter Redis error:', error);
return {
allowed: true,
limit: limit,
remaining: limit,
reset: now + windowSeconds
};
}
}
}
interface RateLimitResult {
allowed: boolean;
limit: number;
remaining: number;
reset: number;
}
// Express middleware using distributed rate limiter
function rateLimitMiddleware(limiter: DistributedRateLimiter) {
return async (req: Request, res: Response, next: NextFunction) => {
const userId = req.user?.id || req.ip;
const result = await limiter.checkLimit(userId, 100, 60);
// Set rate limit headers
res.set('X-RateLimit-Limit', result.limit.toString());
res.set('X-RateLimit-Remaining', result.remaining.toString());
res.set('X-RateLimit-Reset', result.reset.toString());
if (!result.allowed) {
const retryAfter = result.reset - Math.floor(Date.now() / 1000);
res.set('Retry-After', retryAfter.toString());
return res.status(429).json({
error: 'Too Many Requests',
message: 'Rate limit exceeded',
retryAfter: retryAfter
});
}
next();
};
}
Distributed Rate Limiting Best Practices:
- Use Lua scripts: Ensure atomic operations in Redis (no race conditions)
- Fail open: If Redis is unavailable, allow requests rather than blocking all traffic
- Connection pooling: Reuse Redis connections to minimize overhead
- Monitoring: Track Redis latency, connection failures, and failover scenarios
- Backup strategy: Consider local rate limiting as fallback if distributed state unavailable
Database-Level Rate Limiting
For some scenarios, rate limiting can be enforced at the database level using row locks or periodic cleanup of rate limit tables. This is less common but useful when database is already the bottleneck.
-- Rate limit tracking table
CREATE TABLE rate_limits (
user_id VARCHAR(255),
window_start TIMESTAMP,
request_count INTEGER,
PRIMARY KEY (user_id, window_start)
);
CREATE INDEX idx_rate_limits_window ON rate_limits(window_start);
-- Periodic cleanup of old windows
DELETE FROM rate_limits
WHERE window_start < NOW() - INTERVAL '1 hour';
Database-level rate limiting is generally avoided because:
- Adds latency to every request
- Increases database load (defeating the purpose of rate limiting)
- Complex distributed coordination
- Better handled at application or gateway layer
Use database rate limiting only when rate limit state must be persisted for compliance or audit purposes, or when database is already the synchronization point.
Per-User vs Per-IP vs Per-API Key Limiting
Different rate limit keys serve different purposes and protect against different attack vectors. Most systems combine multiple strategies.
Per-User Rate Limiting
Rate limiting authenticated users by user ID provides fair resource allocation and prevents abuse from individual accounts. This is the primary rate limiting strategy for authenticated APIs.
@Component
public class UserRateLimitKeyResolver implements RateLimitKeyResolver {
@Override
public String resolveKey(HttpServletRequest request) {
// Extract user ID from JWT token
String token = request.getHeader("Authorization");
if (token != null && token.startsWith("Bearer ")) {
Claims claims = jwtParser.parseClaimsJws(token.substring(7)).getBody();
return "user:" + claims.getSubject();
}
// Fallback to IP for unauthenticated requests
return "ip:" + getClientIp(request);
}
}
Per-User Rate Limiting Characteristics:
- Fair allocation: Each user gets their own quota
- Account-based: Limits follow the user across devices/IPs
- Subscription tiers: Can vary limits by user plan (free, premium, enterprise)
- Doesn't prevent account creation abuse: Attackers can create many accounts
Per-IP Rate Limiting
Rate limiting by IP address protects against DDoS attacks and brute force attempts from specific sources. This is essential for public endpoints and unauthenticated traffic.
function getClientIp(req: Request): string {
// Check X-Forwarded-For header (from load balancer/proxy)
const forwarded = req.headers['x-forwarded-for'];
if (forwarded) {
// Take first IP (client), not proxy IPs
return forwarded.split(',')[0].trim();
}
// Check X-Real-IP header
const realIp = req.headers['x-real-ip'];
if (realIp) {
return realIp as string;
}
// Fallback to connection remote address
return req.connection.remoteAddress || req.socket.remoteAddress || '';
}
async function rateLimitByIp(req: Request, res: Response, next: NextFunction) {
const ip = getClientIp(req);
const allowed = await rateLimiter.checkLimit(`ip:${ip}`, 1000, 3600); // 1000/hour
if (!allowed) {
return res.status(429).json({ error: 'Too many requests from this IP' });
}
next();
}
Per-IP Rate Limiting Considerations:
- NAT/proxy issues: Multiple users behind corporate NAT share IP
- IPv6 challenges: Users may have many IPv6 addresses
- VPN circumvention: Attackers can rotate IPs via VPN/proxy services
- CDN/proxy detection: Must extract real client IP from headers (X-Forwarded-For)
Best Practice: Combine per-IP and per-user rate limiting - stricter limits for unauthenticated (per-IP), more generous limits for authenticated (per-user).
Per-API Key Rate Limiting
For APIs consumed by other services, rate limiting by API key provides clear quotas per integration. This is standard for public APIs offered to partners or customers.
@Component
public class ApiKeyRateLimiter {
private final RateLimiter rateLimiter;
private final ApiKeyRepository apiKeyRepository;
public boolean checkRateLimit(String apiKey) {
// Look up API key details (includes tier/plan)
ApiKeyInfo keyInfo = apiKeyRepository.findByKey(apiKey)
.orElseThrow(() -> new UnauthorizedException("Invalid API key"));
// Get rate limit based on subscription tier
RateLimitConfig config = getRateLimitForTier(keyInfo.getTier());
String limitKey = "apikey:" + apiKey;
return rateLimiter.allowRequest(limitKey, config.limit(), config.windowSeconds());
}
private RateLimitConfig getRateLimitForTier(SubscriptionTier tier) {
return switch (tier) {
case FREE -> new RateLimitConfig(100, 3600); // 100/hour
case BASIC -> new RateLimitConfig(1000, 3600); // 1000/hour
case PREMIUM -> new RateLimitConfig(10000, 3600); // 10000/hour
case ENTERPRISE -> new RateLimitConfig(100000, 3600); // 100000/hour
};
}
}
record RateLimitConfig(long limit, long windowSeconds) {}
Per-API Key Advantages:
- Clear quotas: Customers know their allocation
- Billing integration: Can tie rate limits to pricing tiers
- Granular tracking: Monitor usage per customer/integration
- Prevents abuse: Revoke keys for violators without affecting others
Burst Allowances and Gradual Backoff
Simply rejecting requests at hard limits can degrade user experience. Burst allowances and gradual backoff provide smoother behavior.
Burst Allowances
Burst allowances permit short-term spikes above average rate while maintaining long-term limits. Token bucket naturally provides this; fixed/sliding windows need explicit burst handling.
public class BurstAwareRateLimiter {
private final long sustainedRate; // Sustained requests per second
private final long burstRate; // Peak requests per second
private final long burstDuration; // How long burst can sustain (seconds)
public BurstAwareRateLimiter(long sustainedRate, long burstRate, long burstDuration) {
this.sustainedRate = sustainedRate;
this.burstRate = burstRate;
this.burstDuration = burstDuration;
}
public boolean allowRequest(String userId) {
// Short window for burst detection (1 second)
boolean burstAllowed = checkLimit(userId, "burst", burstRate, 1);
if (!burstAllowed) {
return false; // Exceeds even burst rate
}
// Long window for sustained rate (1 minute)
boolean sustainedAllowed = checkLimit(userId, "sustained", sustainedRate * 60, 60);
if (!sustainedAllowed) {
// Over sustained rate but under burst rate
// Allow if burst quota available
return checkBurstQuota(userId);
}
return true;
}
private boolean checkBurstQuota(String userId) {
// Track how long user has been bursting
// Deny if bursting too long
Long burstStart = getBurstStartTime(userId);
if (burstStart != null) {
long burstSeconds = (System.currentTimeMillis() / 1000) - burstStart;
if (burstSeconds > burstDuration) {
return false; // Burst duration exceeded
}
} else {
setBurstStartTime(userId, System.currentTimeMillis() / 1000);
}
return true;
}
}
Burst allowances are essential for good user experience when legitimate usage patterns include spikes (page loads, batch operations).
Gradual Backoff
Instead of hard rejection, gradual backoff increases delay or reduces quality as clients approach limits. This provides smoother degradation.
class GradualBackoffRateLimiter {
async handleRequest(userId: string, request: () => Promise<any>): Promise<any> {
const usage = await this.getUsagePercentage(userId);
if (usage >= 1.0) {
// Hard limit reached
throw new RateLimitError('Rate limit exceeded');
}
if (usage >= 0.9) {
// 90-100%: Significant delay
await this.delay(2000);
} else if (usage >= 0.75) {
// 75-90%: Moderate delay
await this.delay(1000);
} else if (usage >= 0.5) {
// 50-75%: Small delay
await this.delay(500);
}
// Under 50%: No delay
return request();
}
private async getUsagePercentage(userId: string): Promise<number> {
const info = await this.rateLimiter.getRateLimitInfo(userId);
return (info.limit - info.remaining) / info.limit;
}
private delay(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
Gradual backoff works well for scenarios where some service is better than no service, but can complicate client-side retry logic.
Rate Limit Headers
Standardized HTTP headers communicate rate limit status to clients, enabling them to self-regulate and avoid hitting limits.
Standard Rate Limit Headers
The IETF draft standard defines three headers for rate limit information:
X-RateLimit-Limit: 100 # Total requests allowed in window
X-RateLimit-Remaining: 73 # Requests remaining in current window
X-RateLimit-Reset: 1640000000 # Unix timestamp when window resets
Some APIs also include additional headers:
X-RateLimit-Policy: 100;w=60 # Policy description (100 per 60 seconds)
Retry-After: 47 # Seconds until client can retry (after 429)
@Component
public class RateLimitHeaderInterceptor implements HandlerInterceptor {
private final RateLimiter rateLimiter;
@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response,
Object handler) throws Exception {
String userId = extractUserId(request);
RateLimitInfo info = rateLimiter.checkLimit(userId);
// Always include rate limit headers (even on successful requests)
response.setHeader("X-RateLimit-Limit", String.valueOf(info.limit()));
response.setHeader("X-RateLimit-Remaining", String.valueOf(info.remaining()));
response.setHeader("X-RateLimit-Reset", String.valueOf(info.reset()));
if (!info.allowed()) {
long retryAfter = info.reset() - (System.currentTimeMillis() / 1000);
response.setHeader("Retry-After", String.valueOf(retryAfter));
response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
response.getWriter().write(
"{\"error\":\"Rate limit exceeded\",\"retryAfter\":" + retryAfter + "}"
);
return false;
}
return true;
}
}
Why Rate Limit Headers Matter:
- Client self-regulation: Clients can slow down before hitting limits
- Better error handling: Clients know when to retry
- Transparency: Users understand their quota usage
- Debugging: Easier to diagnose rate limiting issues
Always include rate limit headers in responses, not just when rate limits are exceeded.
HTTP 429 Too Many Requests
When rate limits are exceeded, return HTTP 429 status code with clear error messages and retry guidance.
interface RateLimitErrorResponse {
error: string;
message: string;
retryAfter: number; // Seconds until retry allowed
limit: number; // Total requests allowed
window: number; // Window size in seconds
documentation?: string; // Link to rate limit docs
}
function createRateLimitResponse(info: RateLimitInfo): RateLimitErrorResponse {
return {
error: 'RATE_LIMIT_EXCEEDED',
message: 'You have exceeded your rate limit. Please wait before making additional requests.',
retryAfter: info.reset - Math.floor(Date.now() / 1000),
limit: info.limit,
window: 60,
documentation: 'https://docs.example.com/api/rate-limits'
};
}
Provide actionable error responses that help developers understand and resolve the issue.
DDoS Protection Strategies
Rate limiting is a crucial component of DDoS (Distributed Denial of Service) protection, but a complete strategy requires multiple layers.
Multi-Layer Defense
Layer 1: CDN and DDoS Protection Services Services like Cloudflare, AWS Shield, and Akamai provide network-level DDoS protection, filtering malicious traffic before it reaches your infrastructure. They detect volumetric attacks (high bandwidth), protocol attacks (SYN floods), and application layer attacks.
Layer 2: Load Balancer Connection Limits Configure load balancers to limit concurrent connections per IP and total connections to prevent resource exhaustion.
Layer 3: API Gateway Rate Limiting Implement the rate limiting strategies discussed earlier to control request rates per user/IP/API key.
Layer 4: Web Application Firewall (WAF) WAF rules detect malicious patterns (SQL injection, XSS) and can automatically block suspicious IPs exhibiting attack behaviors.
Layer 5: Application Business Logic Implement operation-specific limits (login attempts, password resets, expensive queries) based on business context.
Detecting and Mitigating Attacks
@Service
public class DDoSDetectionService {
private final RateLimiter rateLimiter;
private final MetricsRegistry metrics;
@Scheduled(fixedDelay = 60000) // Every minute
public void detectAnomalies() {
// Detect IPs with unusually high request rates
Map<String, Long> ipRequestCounts = getRecentRequestsByIp();
for (Map.Entry<String, Long> entry : ipRequestCounts.entrySet()) {
String ip = entry.getKey();
long requests = entry.getValue();
// Threshold: 10x normal traffic
if (requests > NORMAL_RATE * 10) {
log.warn("Suspicious traffic from IP {}: {} requests/min", ip, requests);
// Automatically block aggressive IPs
if (requests > NORMAL_RATE * 50) {
blockIp(ip, Duration.ofHours(1));
alertSecurityTeam(ip, requests);
}
}
}
}
private void blockIp(String ip, Duration duration) {
// Add to Redis blocklist with expiration
redisTemplate.opsForValue().set(
"blocked:ip:" + ip,
"auto-blocked for suspicious traffic",
duration
);
metrics.counter("ddos.ips.blocked").increment();
}
public boolean isBlocked(String ip) {
return redisTemplate.hasKey("blocked:ip:" + ip);
}
}
// Middleware to check IP blocklist
@Component
public class IpBlocklistFilter implements Filter {
private final DDoSDetectionService ddosDetection;
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
String ip = getClientIp((HttpServletRequest) request);
if (ddosDetection.isBlocked(ip)) {
HttpServletResponse httpResponse = (HttpServletResponse) response;
httpResponse.setStatus(HttpStatus.FORBIDDEN.value());
httpResponse.getWriter().write("Access denied");
return;
}
chain.doFilter(request, response);
}
}
DDoS Detection Indicators:
- Sudden spike in traffic from specific IPs or regions
- High percentage of requests to expensive endpoints
- Unusual request patterns (sequential IDs, parameter fuzzing)
- Many requests returning errors (404, 401)
- Requests with suspicious user agents or missing headers
CAPTCHA and Challenge-Response
For public endpoints vulnerable to abuse (login, registration, password reset), implement CAPTCHA or other challenge-response mechanisms after rate limit thresholds.
@Service
public class LoginService {
private final RateLimiter rateLimiter;
private final CaptchaService captchaService;
public LoginResponse login(LoginRequest request, String ip) {
String key = "login:" + ip;
// Check failed login attempts
long failedAttempts = getFailedAttempts(key);
// Require CAPTCHA after 3 failed attempts
if (failedAttempts >= 3) {
if (request.getCaptchaToken() == null) {
return LoginResponse.requireCaptcha();
}
if (!captchaService.verify(request.getCaptchaToken())) {
return LoginResponse.invalidCaptcha();
}
}
// Check rate limit (10 attempts per 15 minutes)
if (!rateLimiter.allowRequest(key, 10, 900)) {
return LoginResponse.rateLimitExceeded();
}
// Attempt authentication
User user = authenticateUser(request);
if (user == null) {
incrementFailedAttempts(key);
return LoginResponse.authenticationFailed();
}
clearFailedAttempts(key);
return LoginResponse.success(user);
}
}
CAPTCHAs add friction but significantly reduce automated attacks. Use them judiciously - only for sensitive operations and after initial rate limit violations.
GraphQL Rate Limiting
GraphQL presents unique rate limiting challenges because clients construct arbitrary queries with variable complexity. Simple request counting is insufficient since a single query might be cheap or extremely expensive.
Query Cost Analysis
Implement query cost analysis where each field has an assigned cost, and total query cost must stay within limits.
// GraphQL query cost calculator
interface FieldCost {
[fieldName: string]: number;
}
const fieldCosts: FieldCost = {
'User.id': 0,
'User.name': 1,
'User.email': 1,
'User.posts': 5, // Expensive: requires join
'Post.comments': 10, // Very expensive: nested join
'Search.results': 20 // Expensive: full-text search
};
function calculateQueryCost(query: DocumentNode): number {
let totalCost = 0;
visit(query, {
Field(node) {
const parentType = getParentType(node);
const fieldName = `${parentType}.${node.name.value}`;
const cost = fieldCosts[fieldName] || 1;
// Multiply by list size if present
const listSize = node.arguments?.find(arg => arg.name.value === 'first')?.value || 1;
totalCost += cost * (typeof listSize === 'number' ? listSize : 1);
}
});
return totalCost;
}
// GraphQL middleware for cost-based rate limiting
const costRateLimitPlugin: Plugin = {
async requestDidStart() {
return {
async didResolveOperation(requestContext) {
const query = requestContext.document;
const cost = calculateQueryCost(query);
const userId = requestContext.context.userId;
const allowed = await rateLimiter.checkLimit(
`graphql:${userId}`,
1000, // 1000 cost points per hour
3600
);
if (!allowed) {
throw new GraphQLError('GraphQL rate limit exceeded', {
extensions: {
code: 'RATE_LIMIT_EXCEEDED',
cost: cost,
limit: 1000
}
});
}
// Deduct cost from quota
await rateLimiter.consumePoints(`graphql:${userId}`, cost);
}
};
}
};
Query cost analysis ensures expensive nested queries consume more quota than simple queries, providing fair resource allocation.
Query Depth and Complexity Limits
In addition to cost analysis, limit query depth and complexity to prevent malicious queries that could cause excessive database load.
import { ValidationRule } from 'graphql';
// Limit query depth
function depthLimitRule(maxDepth: number): ValidationRule {
return (context) => ({
Field(node) {
const depth = getDepth(node);
if (depth > maxDepth) {
context.reportError(
new GraphQLError(`Query exceeds maximum depth of ${maxDepth}`)
);
}
}
});
}
// Limit query complexity
function complexityLimitRule(maxComplexity: number): ValidationRule {
return (context) => {
let complexity = 0;
return {
Field(node) {
complexity += calculateFieldComplexity(node);
if (complexity > maxComplexity) {
context.reportError(
new GraphQLError(`Query exceeds maximum complexity of ${maxComplexity}`)
);
}
}
};
};
}
// Apply validation rules
const server = new ApolloServer({
typeDefs,
resolvers,
validationRules: [
depthLimitRule(10),
complexityLimitRule(1000)
]
});
Combining cost analysis, depth limits, and complexity limits provides comprehensive protection against GraphQL query abuse.
Monitoring and Alerting
Effective rate limiting requires continuous monitoring to understand traffic patterns, detect abuse, and tune limits appropriately.
Key Metrics
@Component
public class RateLimitMetrics {
private final MeterRegistry registry;
public void recordRateLimitCheck(String userId, boolean allowed, String endpoint) {
// Count allowed vs rejected requests
registry.counter("rate_limit.requests",
"user", userId,
"allowed", String.valueOf(allowed),
"endpoint", endpoint
).increment();
if (!allowed) {
// Track rate limit violations separately
registry.counter("rate_limit.violations",
"user", userId,
"endpoint", endpoint
).increment();
}
}
public void recordRateLimitLatency(Duration latency) {
// Track overhead of rate limit checking
registry.timer("rate_limit.check.duration").record(latency);
}
@Scheduled(fixedDelay = 60000)
public void recordAggregateMetrics() {
// Calculate rejection rate
double rejectionRate = calculateRejectionRate();
registry.gauge("rate_limit.rejection.rate", rejectionRate);
// Track users hitting limits
long usersHittingLimits = countUsersHittingLimits();
registry.gauge("rate_limit.users.limited", usersHittingLimits);
}
}
Essential Metrics:
- Rejection rate: Percentage of requests rejected (high rate may indicate limits too strict)
- Per-endpoint violations: Which endpoints are rate limited most often
- User distribution: How many users hit limits (widespread vs few abusers)
- Rate limit check latency: Overhead added by rate limiting
- Redis connection failures: Availability of distributed rate limit state
Alerts
# Prometheus alert rules for rate limiting
groups:
- name: rate_limit_alerts
rules:
- alert: HighRateLimitRejectionRate
expr: rate(rate_limit_requests{allowed="false"}[5m]) / rate(rate_limit_requests[5m]) > 0.1
for: 10m
annotations:
summary: "More than 10% of requests are rate limited"
description: "Consider investigating if limits are too strict or if there's an attack"
- alert: RateLimitCheckSlow
expr: histogram_quantile(0.95, rate(rate_limit_check_duration_bucket[5m])) > 0.05
for: 5m
annotations:
summary: "Rate limit checks taking too long"
description: "95th percentile latency is {{ $value }}s, check Redis performance"
- alert: ManyUsersHittingLimits
expr: rate_limit_users_limited > 100
for: 15m
annotations:
summary: "Many users hitting rate limits"
description: "{{ $value }} users are hitting rate limits, possible DDoS or limit too strict"
Set up alerts for anomalous rate limiting behavior to quickly detect attacks or configuration issues.
Related Topics
Rate limiting integrates closely with other system design concerns:
- Caching Strategies: Effective caching reduces load and prevents hitting rate limits; cache stampedes can trigger rate limits
- API Design: Design APIs with rate limiting in mind - expose headers, document limits, provide bulk endpoints to reduce request counts
- Security Best Practices: Rate limiting is one layer of defense-in-depth security strategy
- Observability: Monitor rate limit metrics alongside application metrics for comprehensive visibility
- Spring Boot: Spring Cloud Gateway and Spring Boot provide rate limiting integrations
- Performance Optimization: Rate limiting protects performance under load but must be tuned to avoid limiting legitimate traffic