Rate Limiting and Throttling

Rate limiting controls the frequency of requests a client can make to your services within a specified time window. This protects your infrastructure from overload, prevents abuse, ensures fair resource allocation among users, and mitigates denial-of-service attacks. Too restrictive limits frustrate legitimate users; too lenient limits fail to protect your system.

This guide covers rate limiting algorithms, implementation strategies at different architectural layers, and how to communicate limits to clients through standardized headers.

Core Concepts

Understanding the distinction between related concepts helps you choose the right approach for your requirements:

Rate Limiting vs Throttling: Rate limiting rejects requests that exceed the allowed rate with error responses (typically HTTP 429 Too Many Requests). Throttling slows down request processing by introducing delays but still processes all requests. Rate limiting is more common because it provides clear feedback and prevents resource exhaustion.

Hard vs Soft Limits: Hard limits immediately reject requests once the threshold is exceeded. Soft limits allow temporary bursts above the limit but apply penalties (slower processing, reduced priority, warnings). Most systems use hard limits for simplicity and predictability.

Global vs Per-Resource Limits: Global limits apply to all operations from a client (e.g., 1000 requests/hour total). Per-resource limits apply to specific operations (e.g., 10 login attempts/minute, 100 search queries/minute). Combining both provides granular control - global limits prevent overall abuse while per-resource limits protect expensive operations.

Burst Allowances: Many rate limiting algorithms allow short bursts above the average rate to accommodate legitimate traffic spikes. For example, a limit of 100 requests/minute might allow 20 requests in a single second, as long as the average over the minute stays below 100.

Rate Limiting Algorithms

Different algorithms provide different trade-offs between accuracy, memory usage, burst handling, and implementation complexity. Selecting the appropriate algorithm depends on your specific requirements for precision, resource availability, and desired burst behavior.

Token Bucket

The token bucket algorithm maintains a bucket that holds tokens. Tokens are added to the bucket at a constant rate up to a maximum capacity. Each request consumes one or more tokens. If sufficient tokens are available, the request is allowed and tokens are removed. If insufficient tokens exist, the request is rejected.

This algorithm naturally allows bursts up to the bucket capacity while maintaining an average rate over time. Token bucket is the most widely used rate limiting algorithm due to its simplicity and burst-handling characteristics.

// Token bucket implementation in Java
public class TokenBucket {

    private final long capacity;           // Maximum tokens
    private final long refillRate;         // Tokens added per second
    private final AtomicLong tokens;       // Current token count
    private final AtomicLong lastRefill;   // Last refill timestamp

    public TokenBucket(long capacity, long refillRate) {
        this.capacity = capacity;
        this.refillRate = refillRate;
        this.tokens = new AtomicLong(capacity);
        this.lastRefill = new AtomicLong(System.nanoTime());
    }

    public boolean tryConsume(long tokensToConsume) {
        refill();

        long currentTokens = tokens.get();
        if (currentTokens >= tokensToConsume) {
            // Atomic compare-and-swap to handle concurrent requests
            return tokens.compareAndSet(currentTokens, currentTokens - tokensToConsume);
        }

        return false; // Insufficient tokens
    }

    private void refill() {
        long now = System.nanoTime();
        long lastRefillTime = lastRefill.get();

        // Calculate tokens to add based on elapsed time
        long elapsedNanos = now - lastRefillTime;
        long tokensToAdd = (elapsedNanos * refillRate) / 1_000_000_000L;

        if (tokensToAdd > 0) {
            long currentTokens = tokens.get();
            long newTokens = Math.min(capacity, currentTokens + tokensToAdd);

            if (tokens.compareAndSet(currentTokens, newTokens)) {
                lastRefill.set(now);
            }
        }
    }

    public long availableTokens() {
        refill();
        return tokens.get();
    }
}

// Usage in a service
@Service
public class RateLimitedApiService {

    private final Map<String, TokenBucket> buckets = new ConcurrentHashMap<>();
    private static final long RATE_LIMIT = 100;      // 100 requests
    private static final long TIME_WINDOW = 60;      // per 60 seconds

    public boolean allowRequest(String userId) {
        TokenBucket bucket = buckets.computeIfAbsent(userId,
            key -> new TokenBucket(RATE_LIMIT, RATE_LIMIT / TIME_WINDOW)
        );

        return bucket.tryConsume(1);
    }
}

Token Bucket Characteristics:

Allows bursts: Clients can consume all available tokens immediately for burst traffic
Smooth average rate: Refill rate ensures long-term average doesn't exceed limit
Memory efficient: Only stores current token count and last refill time per key
Simple implementation: Straightforward logic that's easy to understand and debug
Most common: Used by AWS API Gateway, Stripe API, many other services

Token bucket is ideal when you want to allow legitimate bursts (e.g., page load making multiple API calls) while preventing sustained abuse.

Leaky Bucket

The leaky bucket algorithm processes requests at a constant rate regardless of arrival pattern. Requests are added to a queue (the bucket), and processed at a fixed rate (the leak). If the queue is full, new requests are rejected. This smooths out bursts by enforcing a consistent processing rate.

// Leaky bucket implementation in TypeScript
class LeakyBucket {
  private queue: Array<() => Promise<void>> = [];
  private processing = false;

  constructor(
    private readonly capacity: number,     // Maximum queue size
    private readonly leakRate: number      // Requests processed per second
  ) {}

  async addRequest(request: () => Promise<void>): Promise<boolean> {
    if (this.queue.length >= this.capacity) {
      return false; // Bucket is full, reject request
    }

    this.queue.push(request);
    this.processQueue();
    return true;
  }

  private async processQueue(): Promise<void> {
    if (this.processing || this.queue.length === 0) {
      return;
    }

    this.processing = true;

    while (this.queue.length > 0) {
      const request = this.queue.shift()!;

      try {
        await request();
      } catch (error) {
        console.error('Request processing failed:', error);
      }

      // Wait for leak rate interval before processing next request
      const intervalMs = 1000 / this.leakRate;
      await new Promise(resolve => setTimeout(resolve, intervalMs));
    }

    this.processing = false;
  }

  getQueueSize(): number {
    return this.queue.length;
  }
}

// Usage example
const bucket = new LeakyBucket(50, 10); // 50 capacity, 10 req/sec

async function handleApiRequest(userId: string, request: () => Promise<void>) {
  const allowed = await bucket.addRequest(request);

  if (!allowed) {
    throw new TooManyRequestsError('Rate limit exceeded, queue full');
  }
}

Leaky Bucket Characteristics:

Constant output rate: Processes requests at fixed rate regardless of input
Smooths bursts: Queues burst traffic and processes at steady rate
No immediate bursts: Unlike token bucket, can't immediately process multiple requests
Queue overhead: Requires queue memory proportional to capacity
Fairness: FIFO processing ensures fair ordering

Leaky bucket is appropriate when you need consistent, predictable load on downstream services and can tolerate queueing delay.

Fixed Window

Fixed window divides time into fixed intervals (windows) and counts requests per window. Each window has an independent counter that resets at window boundaries. This is the simplest rate limiting algorithm but has edge case issues.

// Fixed window counter using Redis
@Service
public class FixedWindowRateLimiter {

    private final RedisTemplate<String, String> redisTemplate;
    private static final long WINDOW_SIZE_SECONDS = 60;
    private static final long MAX_REQUESTS = 100;

    public boolean allowRequest(String userId) {
        long currentWindow = System.currentTimeMillis() / 1000 / WINDOW_SIZE_SECONDS;
        String key = String.format("rate_limit:%s:%d", userId, currentWindow);

        // Increment counter for current window
        Long requests = redisTemplate.opsForValue().increment(key);

        if (requests == 1) {
            // First request in this window, set expiration
            redisTemplate.expire(key, Duration.ofSeconds(WINDOW_SIZE_SECONDS * 2));
        }

        return requests <= MAX_REQUESTS;
    }

    public long getRemainingRequests(String userId) {
        long currentWindow = System.currentTimeMillis() / 1000 / WINDOW_SIZE_SECONDS;
        String key = String.format("rate_limit:%s:%d", userId, currentWindow);

        Long requests = redisTemplate.opsForValue().get(key) != null
            ? Long.parseLong(redisTemplate.opsForValue().get(key))
            : 0L;

        return Math.max(0, MAX_REQUESTS - requests);
    }
}

Fixed Window Problem: Users can make 200 requests in 2 seconds by making 100 requests at the end of window 1 (11:59:59) and 100 requests at the start of window 2 (12:00:00). This violates the intended rate limit of 100 requests per minute.

Fixed Window Characteristics:

Simple implementation: Single counter per window, minimal memory
Low computational cost: Just increment and compare
Edge case issues: Double rate at window boundaries
Acceptable for coarse limits: Works well when precision isn't critical

Fixed window is suitable for coarse-grained rate limiting where the boundary edge case is acceptable, or when simplicity is more important than precision.

Sliding Window Log

Sliding window log maintains a timestamp log of all requests within the time window. For each new request, it removes timestamps outside the current window and checks if the remaining count exceeds the limit. This provides precise rate limiting without edge cases but requires significant memory.

// Sliding window log implementation
class SlidingWindowLog {
  private requestLogs: Map<string, number[]> = new Map();

  constructor(
    private readonly maxRequests: number,
    private readonly windowMs: number
  ) {}

  allowRequest(userId: string): boolean {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    // Get existing log or create new one
    let log = this.requestLogs.get(userId) || [];

    // Remove timestamps outside current window
    log = log.filter(timestamp => timestamp > windowStart);

    if (log.length >= this.maxRequests) {
      this.requestLogs.set(userId, log);
      return false; // Rate limit exceeded
    }

    // Add current request timestamp
    log.push(now);
    this.requestLogs.set(userId, log);

    return true;
  }

  getRemainingRequests(userId: string): number {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    let log = this.requestLogs.get(userId) || [];
    log = log.filter(timestamp => timestamp > windowStart);

    return Math.max(0, this.maxRequests - log.length);
  }

  // Cleanup old entries periodically
  cleanup(): void {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    for (const [userId, log] of this.requestLogs.entries()) {
      const filtered = log.filter(timestamp => timestamp > windowStart);

      if (filtered.length === 0) {
        this.requestLogs.delete(userId);
      } else {
        this.requestLogs.set(userId, filtered);
      }
    }
  }
}

// Redis implementation for distributed scenarios
class RedisSlidingWindowLog {
  constructor(
    private readonly redis: Redis,
    private readonly maxRequests: number,
    private readonly windowMs: number
  ) {}

  async allowRequest(userId: string): Promise<boolean> {
    const key = `rate_limit:log:${userId}`;
    const now = Date.now();
    const windowStart = now - this.windowMs;

    // Remove old timestamps and count remaining
    await this.redis.zremrangebyscore(key, '-inf', windowStart);
    const count = await this.redis.zcard(key);

    if (count >= this.maxRequests) {
      return false;
    }

    // Add current timestamp (score and member are same)
    await this.redis.zadd(key, now, `${now}`);
    await this.redis.pexpire(key, this.windowMs);

    return true;
  }
}

Sliding Window Log Characteristics:

Perfect precision: No edge case issues, accurate to the millisecond
High memory usage: Stores timestamp for every request in the window
Scales poorly: Memory usage grows with request volume
Distributed complexity: Requires synchronized log storage (Redis sorted sets)

Sliding window log is appropriate for critical rate limits where precision is essential and request volumes are manageable (login attempts, password resets, sensitive operations).

Sliding Window Counter

Sliding window counter approximates sliding window log with much lower memory usage. It uses two fixed windows (current and previous) and interpolates between them based on the current position within the window.

The formula estimates requests in the sliding window as:

requests = (previous_window_count * overlap_percentage) + current_window_count

// Sliding window counter implementation
@Service
public class SlidingWindowCounter {

    private final RedisTemplate<String, String> redisTemplate;
    private static final long WINDOW_SIZE_SECONDS = 60;
    private static final long MAX_REQUESTS = 100;

    public boolean allowRequest(String userId) {
        long now = System.currentTimeMillis() / 1000;
        long currentWindow = now / WINDOW_SIZE_SECONDS;
        long previousWindow = currentWindow - 1;

        String currentKey = String.format("rate_limit:%s:%d", userId, currentWindow);
        String previousKey = String.format("rate_limit:%s:%d", userId, previousWindow);

        // Get counts from both windows
        long currentCount = increment(currentKey);
        long previousCount = getCount(previousKey);

        // Calculate position within current window (0.0 to 1.0)
        double windowPosition = (double) (now % WINDOW_SIZE_SECONDS) / WINDOW_SIZE_SECONDS;

        // Estimate total requests in sliding window
        double estimatedCount = (previousCount * (1 - windowPosition)) + currentCount;

        return estimatedCount <= MAX_REQUESTS;
    }

    private long increment(String key) {
        Long count = redisTemplate.opsForValue().increment(key);
        if (count == 1) {
            redisTemplate.expire(key, Duration.ofSeconds(WINDOW_SIZE_SECONDS * 2));
        }
        return count;
    }

    private long getCount(String key) {
        String value = redisTemplate.opsForValue().get(key);
        return value != null ? Long.parseLong(value) : 0L;
    }

    public RateLimitInfo getRateLimitInfo(String userId) {
        long now = System.currentTimeMillis() / 1000;
        long currentWindow = now / WINDOW_SIZE_SECONDS;
        long previousWindow = currentWindow - 1;

        String currentKey = String.format("rate_limit:%s:%d", userId, currentWindow);
        String previousKey = String.format("rate_limit:%s:%d", userId, previousWindow);

        long currentCount = getCount(currentKey);
        long previousCount = getCount(previousKey);

        double windowPosition = (double) (now % WINDOW_SIZE_SECONDS) / WINDOW_SIZE_SECONDS;
        double estimatedCount = (previousCount * (1 - windowPosition)) + currentCount;

        long remaining = Math.max(0, MAX_REQUESTS - (long) Math.ceil(estimatedCount));
        long resetTime = (currentWindow + 1) * WINDOW_SIZE_SECONDS;

        return new RateLimitInfo(MAX_REQUESTS, remaining, resetTime);
    }
}

record RateLimitInfo(long limit, long remaining, long reset) {}

Sliding Window Counter Characteristics:

Good precision: Much better than fixed window, slight approximation vs log
Memory efficient: Only two counters per user
No edge cases: Smooth behavior at window boundaries
Widely used: Good balance of accuracy and efficiency

Sliding window counter is the recommended algorithm for most use cases - it provides excellent precision with minimal overhead.

Implementation Strategies

Rate limiting can be implemented at different layers of your architecture. The choice depends on your requirements for centralization, latency tolerance, and infrastructure complexity.

API Gateway Rate Limiting

Implementing rate limiting at the API gateway provides centralized control and prevents rate-limited requests from reaching application servers. This is the most efficient approach for protecting infrastructure.

# Spring Cloud Gateway rate limiting configuration
spring:
  cloud:
    gateway:
      routes:
        - id: api_route
          uri: lb://backend-service
          predicates:
            - Path=/api/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 10    # Tokens per second
                redis-rate-limiter.burstCapacity: 20     # Maximum burst
                redis-rate-limiter.requestedTokens: 1    # Tokens per request
                key-resolver: "#{@userKeyResolver}"     # Extract user ID

# Custom key resolver for extracting rate limit key
@Component
public class UserKeyResolver implements KeyResolver {

    @Override
    public Mono<String> resolve(ServerWebExchange exchange) {
        // Extract user ID from JWT token or API key
        return exchange.getPrincipal()
            .map(Principal::getName)
            .defaultIfEmpty("anonymous");
    }
}

Gateway Rate Limiting Advantages:

Early rejection: Blocks requests before they consume application resources
Centralized configuration: Single place to manage all rate limits
Consistent enforcement: Same limits across all backend instances
Infrastructure protection: Prevents DDoS from reaching applications

Gateway Rate Limiting Considerations:

Single point of failure: Gateway outage affects all traffic
Limited context: May not have business logic context for nuanced limits
Coordination overhead: Distributed gateways need shared state (Redis)

Application-Level Rate Limiting

Implementing rate limiting within the application provides access to business context and allows fine-grained per-operation limits. This complements gateway-level limits for additional protection.

// Spring Boot application-level rate limiting
@RestController
@RequestMapping("/api")
public class UserController {

    private final RateLimiter rateLimiter;
    private final UserService userService;

    @GetMapping("/users/{id}")
    public ResponseEntity<User> getUser(@PathVariable String id, HttpServletRequest request) {
        String userId = extractUserId(request);

        // Check rate limit
        if (!rateLimiter.allowRequest(userId, "get_user")) {
            RateLimitInfo info = rateLimiter.getRateLimitInfo(userId, "get_user");

            return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS)
                .header("X-RateLimit-Limit", String.valueOf(info.limit()))
                .header("X-RateLimit-Remaining", "0")
                .header("X-RateLimit-Reset", String.valueOf(info.reset()))
                .header("Retry-After", String.valueOf(info.reset() - System.currentTimeMillis() / 1000))
                .build();
        }

        User user = userService.getUser(id);

        RateLimitInfo info = rateLimiter.getRateLimitInfo(userId, "get_user");
        return ResponseEntity.ok()
            .header("X-RateLimit-Limit", String.valueOf(info.limit()))
            .header("X-RateLimit-Remaining", String.valueOf(info.remaining()))
            .header("X-RateLimit-Reset", String.valueOf(info.reset()))
            .body(user);
    }

    // More restrictive limit for expensive operations
    @PostMapping("/users/{id}/reports")
    public ResponseEntity<Report> generateReport(@PathVariable String id, HttpServletRequest request) {
        String userId = extractUserId(request);

        // Different limit for expensive operation
        if (!rateLimiter.allowRequest(userId, "generate_report")) {
            throw new RateLimitExceededException("Report generation limit exceeded");
        }

        Report report = userService.generateReport(id);
        return ResponseEntity.ok(report);
    }
}

// Exception handler for rate limit violations
@ControllerAdvice
public class RateLimitExceptionHandler {

    @ExceptionHandler(RateLimitExceededException.class)
    public ResponseEntity<ErrorResponse> handleRateLimit(RateLimitExceededException ex) {
        return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS)
            .body(new ErrorResponse(
                "RATE_LIMIT_EXCEEDED",
                ex.getMessage(),
                "Please wait before making additional requests"
            ));
    }
}

Application-Level Advantages:

Business context: Access to user roles, subscription tiers, operation types
Granular control: Different limits per endpoint, user type, or operation
Flexible logic: Custom rate limit rules based on complex criteria
Graceful degradation: Can queue or defer rather than reject

Application-Level Considerations:

Resource consumption: Rate-limited requests still consume network and parsing resources
Consistency: Must coordinate across multiple instances (requires distributed state)
Implementation overhead: More code to maintain vs gateway configuration

Distributed Rate Limiting

When running multiple application instances, rate limiting requires shared state to enforce consistent limits across all instances. Redis is the most common solution for distributed rate limiting.

// Distributed rate limiter using Redis Lua scripts for atomicity
import Redis from 'ioredis';

class DistributedRateLimiter {
  private redis: Redis;

  // Lua script for atomic sliding window counter
  private slidingWindowScript = `
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window = tonumber(ARGV[2])
    local limit = tonumber(ARGV[3])

    local current_window = math.floor(now / window)
    local previous_window = current_window - 1

    local current_key = key .. ":" .. current_window
    local previous_key = key .. ":" .. previous_window

    local current_count = tonumber(redis.call("GET", current_key) or "0")
    local previous_count = tonumber(redis.call("GET", previous_key) or "0")

    local window_position = (now % window) / window
    local estimated_count = math.floor((previous_count * (1 - window_position)) + current_count)

    if estimated_count >= limit then
      return {0, limit, 0, current_window + 1}
    end

    redis.call("INCR", current_key)
    redis.call("EXPIRE", current_key, window * 2)

    local remaining = limit - (current_count + 1)
    return {1, limit, remaining, current_window + 1}
  `;

  constructor() {
    this.redis = new Redis({
      host: process.env.REDIS_HOST,
      port: parseInt(process.env.REDIS_PORT || '6379'),
      maxRetriesPerRequest: 3
    });

    this.redis.defineCommand('slidingWindowLimit', {
      numberOfKeys: 1,
      lua: this.slidingWindowScript
    });
  }

  async checkLimit(
    userId: string,
    limit: number,
    windowSeconds: number
  ): Promise<RateLimitResult> {
    const key = `rate_limit:${userId}`;
    const now = Math.floor(Date.now() / 1000);

    try {
      // @ts-ignore - Custom command defined above
      const result = await this.redis.slidingWindowLimit(
        key,
        now,
        windowSeconds,
        limit
      );

      const [allowed, totalLimit, remaining, resetWindow] = result;

      return {
        allowed: allowed === 1,
        limit: totalLimit,
        remaining: remaining,
        reset: resetWindow * windowSeconds
      };
    } catch (error) {
      // Fail open: allow request if Redis is unavailable
      console.error('Rate limiter Redis error:', error);
      return {
        allowed: true,
        limit: limit,
        remaining: limit,
        reset: now + windowSeconds
      };
    }
  }
}

interface RateLimitResult {
  allowed: boolean;
  limit: number;
  remaining: number;
  reset: number;
}

// Express middleware using distributed rate limiter
function rateLimitMiddleware(limiter: DistributedRateLimiter) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const userId = req.user?.id || req.ip;
    const result = await limiter.checkLimit(userId, 100, 60);

    // Set rate limit headers
    res.set('X-RateLimit-Limit', result.limit.toString());
    res.set('X-RateLimit-Remaining', result.remaining.toString());
    res.set('X-RateLimit-Reset', result.reset.toString());

    if (!result.allowed) {
      const retryAfter = result.reset - Math.floor(Date.now() / 1000);
      res.set('Retry-After', retryAfter.toString());

      return res.status(429).json({
        error: 'Too Many Requests',
        message: 'Rate limit exceeded',
        retryAfter: retryAfter
      });
    }

    next();
  };
}

Distributed Rate Limiting Best Practices:

Use Lua scripts: Ensure atomic operations in Redis (no race conditions)
Fail open: If Redis is unavailable, allow requests rather than blocking all traffic
Connection pooling: Reuse Redis connections to minimize overhead
Monitoring: Track Redis latency, connection failures, and failover scenarios
Backup strategy: Consider local rate limiting as fallback if distributed state unavailable

Database-Level Rate Limiting

For some scenarios, rate limiting can be enforced at the database level using row locks or periodic cleanup of rate limit tables. This is less common but useful when database is already the bottleneck.

-- Rate limit tracking table
CREATE TABLE rate_limits (
    user_id VARCHAR(255),
    window_start TIMESTAMP,
    request_count INTEGER,
    PRIMARY KEY (user_id, window_start)
);

CREATE INDEX idx_rate_limits_window ON rate_limits(window_start);

-- Periodic cleanup of old windows
DELETE FROM rate_limits
WHERE window_start < NOW() - INTERVAL '1 hour';

Database-level rate limiting is generally avoided because:

Adds latency to every request
Increases database load (defeating the purpose of rate limiting)
Complex distributed coordination
Better handled at application or gateway layer

Use database rate limiting only when rate limit state must be persisted for compliance or audit purposes, or when database is already the synchronization point.

Per-User vs Per-IP vs Per-API Key Limiting

Different rate limit keys serve different purposes and protect against different attack vectors. Most systems combine multiple strategies.

Per-User Rate Limiting

Rate limiting authenticated users by user ID provides fair resource allocation and prevents abuse from individual accounts. This is the primary rate limiting strategy for authenticated APIs.

@Component
public class UserRateLimitKeyResolver implements RateLimitKeyResolver {

    @Override
    public String resolveKey(HttpServletRequest request) {
        // Extract user ID from JWT token
        String token = request.getHeader("Authorization");
        if (token != null && token.startsWith("Bearer ")) {
            Claims claims = jwtParser.parseClaimsJws(token.substring(7)).getBody();
            return "user:" + claims.getSubject();
        }

        // Fallback to IP for unauthenticated requests
        return "ip:" + getClientIp(request);
    }
}

Per-User Rate Limiting Characteristics:

Fair allocation: Each user gets their own quota
Account-based: Limits follow the user across devices/IPs
Subscription tiers: Can vary limits by user plan (free, premium, enterprise)
Doesn't prevent account creation abuse: Attackers can create many accounts

Per-IP Rate Limiting

Rate limiting by IP address protects against DDoS attacks and brute force attempts from specific sources. This is essential for public endpoints and unauthenticated traffic.

function getClientIp(req: Request): string {
  // Check X-Forwarded-For header (from load balancer/proxy)
  const forwarded = req.headers['x-forwarded-for'];
  if (forwarded) {
    // Take first IP (client), not proxy IPs
    return forwarded.split(',')[0].trim();
  }

  // Check X-Real-IP header
  const realIp = req.headers['x-real-ip'];
  if (realIp) {
    return realIp as string;
  }

  // Fallback to connection remote address
  return req.connection.remoteAddress || req.socket.remoteAddress || '';
}

async function rateLimitByIp(req: Request, res: Response, next: NextFunction) {
  const ip = getClientIp(req);
  const allowed = await rateLimiter.checkLimit(`ip:${ip}`, 1000, 3600); // 1000/hour

  if (!allowed) {
    return res.status(429).json({ error: 'Too many requests from this IP' });
  }

  next();
}

Per-IP Rate Limiting Considerations:

NAT/proxy issues: Multiple users behind corporate NAT share IP
IPv6 challenges: Users may have many IPv6 addresses
VPN circumvention: Attackers can rotate IPs via VPN/proxy services
CDN/proxy detection: Must extract real client IP from headers (X-Forwarded-For)

Best Practice: Combine per-IP and per-user rate limiting - stricter limits for unauthenticated (per-IP), more generous limits for authenticated (per-user).

Per-API Key Rate Limiting

For APIs consumed by other services, rate limiting by API key provides clear quotas per integration. This is standard for public APIs offered to partners or customers.

@Component
public class ApiKeyRateLimiter {

    private final RateLimiter rateLimiter;
    private final ApiKeyRepository apiKeyRepository;

    public boolean checkRateLimit(String apiKey) {
        // Look up API key details (includes tier/plan)
        ApiKeyInfo keyInfo = apiKeyRepository.findByKey(apiKey)
            .orElseThrow(() -> new UnauthorizedException("Invalid API key"));

        // Get rate limit based on subscription tier
        RateLimitConfig config = getRateLimitForTier(keyInfo.getTier());

        String limitKey = "apikey:" + apiKey;
        return rateLimiter.allowRequest(limitKey, config.limit(), config.windowSeconds());
    }

    private RateLimitConfig getRateLimitForTier(SubscriptionTier tier) {
        return switch (tier) {
            case FREE -> new RateLimitConfig(100, 3600);        // 100/hour
            case BASIC -> new RateLimitConfig(1000, 3600);      // 1000/hour
            case PREMIUM -> new RateLimitConfig(10000, 3600);   // 10000/hour
            case ENTERPRISE -> new RateLimitConfig(100000, 3600); // 100000/hour
        };
    }
}

record RateLimitConfig(long limit, long windowSeconds) {}

Per-API Key Advantages:

Clear quotas: Customers know their allocation
Billing integration: Can tie rate limits to pricing tiers
Granular tracking: Monitor usage per customer/integration
Prevents abuse: Revoke keys for violators without affecting others

Burst Allowances and Gradual Backoff

Simply rejecting requests at hard limits can degrade user experience. Burst allowances and gradual backoff provide smoother behavior.

Burst Allowances

Burst allowances permit short-term spikes above average rate while maintaining long-term limits. Token bucket naturally provides this; fixed/sliding windows need explicit burst handling.

public class BurstAwareRateLimiter {

    private final long sustainedRate;  // Sustained requests per second
    private final long burstRate;      // Peak requests per second
    private final long burstDuration;  // How long burst can sustain (seconds)

    public BurstAwareRateLimiter(long sustainedRate, long burstRate, long burstDuration) {
        this.sustainedRate = sustainedRate;
        this.burstRate = burstRate;
        this.burstDuration = burstDuration;
    }

    public boolean allowRequest(String userId) {
        // Short window for burst detection (1 second)
        boolean burstAllowed = checkLimit(userId, "burst", burstRate, 1);
        if (!burstAllowed) {
            return false; // Exceeds even burst rate
        }

        // Long window for sustained rate (1 minute)
        boolean sustainedAllowed = checkLimit(userId, "sustained", sustainedRate * 60, 60);
        if (!sustainedAllowed) {
            // Over sustained rate but under burst rate
            // Allow if burst quota available
            return checkBurstQuota(userId);
        }

        return true;
    }

    private boolean checkBurstQuota(String userId) {
        // Track how long user has been bursting
        // Deny if bursting too long
        Long burstStart = getBurstStartTime(userId);
        if (burstStart != null) {
            long burstSeconds = (System.currentTimeMillis() / 1000) - burstStart;
            if (burstSeconds > burstDuration) {
                return false; // Burst duration exceeded
            }
        } else {
            setBurstStartTime(userId, System.currentTimeMillis() / 1000);
        }

        return true;
    }
}

Burst allowances are essential for good user experience when legitimate usage patterns include spikes (page loads, batch operations).

Gradual Backoff

Instead of hard rejection, gradual backoff increases delay or reduces quality as clients approach limits. This provides smoother degradation.

class GradualBackoffRateLimiter {
  async handleRequest(userId: string, request: () => Promise<any>): Promise<any> {
    const usage = await this.getUsagePercentage(userId);

    if (usage >= 1.0) {
      // Hard limit reached
      throw new RateLimitError('Rate limit exceeded');
    }

    if (usage >= 0.9) {
      // 90-100%: Significant delay
      await this.delay(2000);
    } else if (usage >= 0.75) {
      // 75-90%: Moderate delay
      await this.delay(1000);
    } else if (usage >= 0.5) {
      // 50-75%: Small delay
      await this.delay(500);
    }

    // Under 50%: No delay
    return request();
  }

  private async getUsagePercentage(userId: string): Promise<number> {
    const info = await this.rateLimiter.getRateLimitInfo(userId);
    return (info.limit - info.remaining) / info.limit;
  }

  private delay(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Gradual backoff works well for scenarios where some service is better than no service, but can complicate client-side retry logic.

Rate Limit Headers

Standardized HTTP headers communicate rate limit status to clients, enabling them to self-regulate and avoid hitting limits.

Standard Rate Limit Headers

The IETF draft standard defines three headers for rate limit information:

X-RateLimit-Limit: 100           # Total requests allowed in window
X-RateLimit-Remaining: 73        # Requests remaining in current window
X-RateLimit-Reset: 1640000000    # Unix timestamp when window resets

Some APIs also include additional headers:

X-RateLimit-Policy: 100;w=60     # Policy description (100 per 60 seconds)
Retry-After: 47                  # Seconds until client can retry (after 429)

@Component
public class RateLimitHeaderInterceptor implements HandlerInterceptor {

    private final RateLimiter rateLimiter;

    @Override
    public boolean preHandle(HttpServletRequest request,
                            HttpServletResponse response,
                            Object handler) throws Exception {

        String userId = extractUserId(request);
        RateLimitInfo info = rateLimiter.checkLimit(userId);

        // Always include rate limit headers (even on successful requests)
        response.setHeader("X-RateLimit-Limit", String.valueOf(info.limit()));
        response.setHeader("X-RateLimit-Remaining", String.valueOf(info.remaining()));
        response.setHeader("X-RateLimit-Reset", String.valueOf(info.reset()));

        if (!info.allowed()) {
            long retryAfter = info.reset() - (System.currentTimeMillis() / 1000);
            response.setHeader("Retry-After", String.valueOf(retryAfter));
            response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
            response.getWriter().write(
                "{\"error\":\"Rate limit exceeded\",\"retryAfter\":" + retryAfter + "}"
            );
            return false;
        }

        return true;
    }
}

Why Rate Limit Headers Matter:

Client self-regulation: Clients can slow down before hitting limits
Better error handling: Clients know when to retry
Transparency: Users understand their quota usage
Debugging: Easier to diagnose rate limiting issues

Always include rate limit headers in responses, not just when rate limits are exceeded.

HTTP 429 Too Many Requests

When rate limits are exceeded, return HTTP 429 status code with clear error messages and retry guidance.

interface RateLimitErrorResponse {
  error: string;
  message: string;
  retryAfter: number;      // Seconds until retry allowed
  limit: number;           // Total requests allowed
  window: number;          // Window size in seconds
  documentation?: string;  // Link to rate limit docs
}

function createRateLimitResponse(info: RateLimitInfo): RateLimitErrorResponse {
  return {
    error: 'RATE_LIMIT_EXCEEDED',
    message: 'You have exceeded your rate limit. Please wait before making additional requests.',
    retryAfter: info.reset - Math.floor(Date.now() / 1000),
    limit: info.limit,
    window: 60,
    documentation: 'https://docs.example.com/api/rate-limits'
  };
}

Provide actionable error responses that help developers understand and resolve the issue.

DDoS Protection Strategies

Rate limiting is a crucial component of DDoS (Distributed Denial of Service) protection, but a complete strategy requires multiple layers.

Multi-Layer Defense

Layer 1: CDN and DDoS Protection Services Services like Cloudflare, AWS Shield, and Akamai provide network-level DDoS protection, filtering malicious traffic before it reaches your infrastructure. They detect volumetric attacks (high bandwidth), protocol attacks (SYN floods), and application layer attacks.

Layer 2: Load Balancer Connection Limits Configure load balancers to limit concurrent connections per IP and total connections to prevent resource exhaustion.

Layer 3: API Gateway Rate Limiting Implement the rate limiting strategies discussed earlier to control request rates per user/IP/API key.

Layer 4: Web Application Firewall (WAF) WAF rules detect malicious patterns (SQL injection, XSS) and can automatically block suspicious IPs exhibiting attack behaviors.

Layer 5: Application Business Logic Implement operation-specific limits (login attempts, password resets, expensive queries) based on business context.

Detecting and Mitigating Attacks

@Service
public class DDoSDetectionService {

    private final RateLimiter rateLimiter;
    private final MetricsRegistry metrics;

    @Scheduled(fixedDelay = 60000) // Every minute
    public void detectAnomalies() {
        // Detect IPs with unusually high request rates
        Map<String, Long> ipRequestCounts = getRecentRequestsByIp();

        for (Map.Entry<String, Long> entry : ipRequestCounts.entrySet()) {
            String ip = entry.getKey();
            long requests = entry.getValue();

            // Threshold: 10x normal traffic
            if (requests > NORMAL_RATE * 10) {
                log.warn("Suspicious traffic from IP {}: {} requests/min", ip, requests);

                // Automatically block aggressive IPs
                if (requests > NORMAL_RATE * 50) {
                    blockIp(ip, Duration.ofHours(1));
                    alertSecurityTeam(ip, requests);
                }
            }
        }
    }

    private void blockIp(String ip, Duration duration) {
        // Add to Redis blocklist with expiration
        redisTemplate.opsForValue().set(
            "blocked:ip:" + ip,
            "auto-blocked for suspicious traffic",
            duration
        );

        metrics.counter("ddos.ips.blocked").increment();
    }

    public boolean isBlocked(String ip) {
        return redisTemplate.hasKey("blocked:ip:" + ip);
    }
}

// Middleware to check IP blocklist
@Component
public class IpBlocklistFilter implements Filter {

    private final DDoSDetectionService ddosDetection;

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {

        String ip = getClientIp((HttpServletRequest) request);

        if (ddosDetection.isBlocked(ip)) {
            HttpServletResponse httpResponse = (HttpServletResponse) response;
            httpResponse.setStatus(HttpStatus.FORBIDDEN.value());
            httpResponse.getWriter().write("Access denied");
            return;
        }

        chain.doFilter(request, response);
    }
}

DDoS Detection Indicators:

Sudden spike in traffic from specific IPs or regions
High percentage of requests to expensive endpoints
Unusual request patterns (sequential IDs, parameter fuzzing)
Many requests returning errors (404, 401)
Requests with suspicious user agents or missing headers

CAPTCHA and Challenge-Response

For public endpoints vulnerable to abuse (login, registration, password reset), implement CAPTCHA or other challenge-response mechanisms after rate limit thresholds.

@Service
public class LoginService {

    private final RateLimiter rateLimiter;
    private final CaptchaService captchaService;

    public LoginResponse login(LoginRequest request, String ip) {
        String key = "login:" + ip;

        // Check failed login attempts
        long failedAttempts = getFailedAttempts(key);

        // Require CAPTCHA after 3 failed attempts
        if (failedAttempts >= 3) {
            if (request.getCaptchaToken() == null) {
                return LoginResponse.requireCaptcha();
            }

            if (!captchaService.verify(request.getCaptchaToken())) {
                return LoginResponse.invalidCaptcha();
            }
        }

        // Check rate limit (10 attempts per 15 minutes)
        if (!rateLimiter.allowRequest(key, 10, 900)) {
            return LoginResponse.rateLimitExceeded();
        }

        // Attempt authentication
        User user = authenticateUser(request);

        if (user == null) {
            incrementFailedAttempts(key);
            return LoginResponse.authenticationFailed();
        }

        clearFailedAttempts(key);
        return LoginResponse.success(user);
    }
}

CAPTCHAs add friction but significantly reduce automated attacks. Use them judiciously - only for sensitive operations and after initial rate limit violations.

GraphQL Rate Limiting

GraphQL presents unique rate limiting challenges because clients construct arbitrary queries with variable complexity. Simple request counting is insufficient since a single query might be cheap or extremely expensive.

Query Cost Analysis

Implement query cost analysis where each field has an assigned cost, and total query cost must stay within limits.

// GraphQL query cost calculator
interface FieldCost {
  [fieldName: string]: number;
}

const fieldCosts: FieldCost = {
  'User.id': 0,
  'User.name': 1,
  'User.email': 1,
  'User.posts': 5,           // Expensive: requires join
  'Post.comments': 10,       // Very expensive: nested join
  'Search.results': 20       // Expensive: full-text search
};

function calculateQueryCost(query: DocumentNode): number {
  let totalCost = 0;

  visit(query, {
    Field(node) {
      const parentType = getParentType(node);
      const fieldName = `${parentType}.${node.name.value}`;
      const cost = fieldCosts[fieldName] || 1;

      // Multiply by list size if present
      const listSize = node.arguments?.find(arg => arg.name.value === 'first')?.value || 1;
      totalCost += cost * (typeof listSize === 'number' ? listSize : 1);
    }
  });

  return totalCost;
}

// GraphQL middleware for cost-based rate limiting
const costRateLimitPlugin: Plugin = {
  async requestDidStart() {
    return {
      async didResolveOperation(requestContext) {
        const query = requestContext.document;
        const cost = calculateQueryCost(query);

        const userId = requestContext.context.userId;
        const allowed = await rateLimiter.checkLimit(
          `graphql:${userId}`,
          1000, // 1000 cost points per hour
          3600
        );

        if (!allowed) {
          throw new GraphQLError('GraphQL rate limit exceeded', {
            extensions: {
              code: 'RATE_LIMIT_EXCEEDED',
              cost: cost,
              limit: 1000
            }
          });
        }

        // Deduct cost from quota
        await rateLimiter.consumePoints(`graphql:${userId}`, cost);
      }
    };
  }
};

Query cost analysis ensures expensive nested queries consume more quota than simple queries, providing fair resource allocation.

Query Depth and Complexity Limits

In addition to cost analysis, limit query depth and complexity to prevent malicious queries that could cause excessive database load.

import { ValidationRule } from 'graphql';

// Limit query depth
function depthLimitRule(maxDepth: number): ValidationRule {
  return (context) => ({
    Field(node) {
      const depth = getDepth(node);
      if (depth > maxDepth) {
        context.reportError(
          new GraphQLError(`Query exceeds maximum depth of ${maxDepth}`)
        );
      }
    }
  });
}

// Limit query complexity
function complexityLimitRule(maxComplexity: number): ValidationRule {
  return (context) => {
    let complexity = 0;

    return {
      Field(node) {
        complexity += calculateFieldComplexity(node);

        if (complexity > maxComplexity) {
          context.reportError(
            new GraphQLError(`Query exceeds maximum complexity of ${maxComplexity}`)
          );
        }
      }
    };
  };
}

// Apply validation rules
const server = new ApolloServer({
  typeDefs,
  resolvers,
  validationRules: [
    depthLimitRule(10),
    complexityLimitRule(1000)
  ]
});

Combining cost analysis, depth limits, and complexity limits provides comprehensive protection against GraphQL query abuse.

Monitoring and Alerting

Effective rate limiting requires continuous monitoring to understand traffic patterns, detect abuse, and tune limits appropriately.

Key Metrics

@Component
public class RateLimitMetrics {

    private final MeterRegistry registry;

    public void recordRateLimitCheck(String userId, boolean allowed, String endpoint) {
        // Count allowed vs rejected requests
        registry.counter("rate_limit.requests",
            "user", userId,
            "allowed", String.valueOf(allowed),
            "endpoint", endpoint
        ).increment();

        if (!allowed) {
            // Track rate limit violations separately
            registry.counter("rate_limit.violations",
                "user", userId,
                "endpoint", endpoint
            ).increment();
        }
    }

    public void recordRateLimitLatency(Duration latency) {
        // Track overhead of rate limit checking
        registry.timer("rate_limit.check.duration").record(latency);
    }

    @Scheduled(fixedDelay = 60000)
    public void recordAggregateMetrics() {
        // Calculate rejection rate
        double rejectionRate = calculateRejectionRate();
        registry.gauge("rate_limit.rejection.rate", rejectionRate);

        // Track users hitting limits
        long usersHittingLimits = countUsersHittingLimits();
        registry.gauge("rate_limit.users.limited", usersHittingLimits);
    }
}

Essential Metrics:

Rejection rate: Percentage of requests rejected (high rate may indicate limits too strict)
Per-endpoint violations: Which endpoints are rate limited most often
User distribution: How many users hit limits (widespread vs few abusers)
Rate limit check latency: Overhead added by rate limiting
Redis connection failures: Availability of distributed rate limit state

Alerts

# Prometheus alert rules for rate limiting
groups:
  - name: rate_limit_alerts
    rules:
      - alert: HighRateLimitRejectionRate
        expr: rate(rate_limit_requests{allowed="false"}[5m]) / rate(rate_limit_requests[5m]) > 0.1
        for: 10m
        annotations:
          summary: "More than 10% of requests are rate limited"
          description: "Consider investigating if limits are too strict or if there's an attack"

      - alert: RateLimitCheckSlow
        expr: histogram_quantile(0.95, rate(rate_limit_check_duration_bucket[5m])) > 0.05
        for: 5m
        annotations:
          summary: "Rate limit checks taking too long"
          description: "95th percentile latency is {{ $value }}s, check Redis performance"

      - alert: ManyUsersHittingLimits
        expr: rate_limit_users_limited > 100
        for: 15m
        annotations:
          summary: "Many users hitting rate limits"
          description: "{{ $value }} users are hitting rate limits, possible DDoS or limit too strict"

Set up alerts for anomalous rate limiting behavior to quickly detect attacks or configuration issues.

Rate limiting integrates closely with other system design concerns:

Caching Strategies: Effective caching reduces load and prevents hitting rate limits; cache stampedes can trigger rate limits
API Design: Design APIs with rate limiting in mind - expose headers, document limits, provide bulk endpoints to reduce request counts
Security Best Practices: Rate limiting is one layer of defense-in-depth security strategy
Observability: Monitor rate limit metrics alongside application metrics for comprehensive visibility
Spring Boot: Spring Cloud Gateway and Spring Boot provide rate limiting integrations
Performance Optimization: Rate limiting protects performance under load but must be tuned to avoid limiting legitimate traffic

Core Concepts​

Rate Limiting Algorithms​

Token Bucket​

Leaky Bucket​

Fixed Window​

Sliding Window Log​

Sliding Window Counter​

Implementation Strategies​

API Gateway Rate Limiting​

Application-Level Rate Limiting​

Distributed Rate Limiting​

Database-Level Rate Limiting​

Per-User vs Per-IP vs Per-API Key Limiting​

Per-User Rate Limiting​

Per-IP Rate Limiting​

Per-API Key Rate Limiting​

Burst Allowances and Gradual Backoff​

Burst Allowances​

Gradual Backoff​

Rate Limit Headers​

Standard Rate Limit Headers​

HTTP 429 Too Many Requests​

DDoS Protection Strategies​

Multi-Layer Defense​

Detecting and Mitigating Attacks​

CAPTCHA and Challenge-Response​

GraphQL Rate Limiting​

Query Cost Analysis​

Query Depth and Complexity Limits​

Monitoring and Alerting​

Key Metrics​

Alerts​

Related Topics​