Sections
Related Guides
LLD Case Study: Thread-Safe LRU Cache
Low-Level Design
Concurrency Patterns: Thread Safety, Producer-Consumer, Read-Write Lock & Thread Pool
Low-Level Design
Rate Limiter Design: Token Bucket, Sliding Window, and Distributed Enforcement
High-Level Design
Design Patterns for LLD Interviews
Low-Level Design
Distributed Systems Patterns
High-Level Design
LLD Case Study: Extensible Rate Limiter
Design a thread-safe, extensible rate limiter supporting token bucket and sliding window log algorithms, with clean interfaces, deterministic testability, and a clear migration path from a single-process limiter to a distributed Redis-backed cluster. Covers the exact concurrency pitfalls — race conditions in read-modify-write, thundering-herd on cold buckets — and how production systems like Stripe, Cloudflare, and Kong solve them.
Why Rate Limiters Are a Core LLD Interview Problem
Rate limiters appear at every major API gateway: Stripe limits API calls per key, GitHub throttles REST and GraphQL separately, Kong and Nginx enforce request-per-second at the edge. They are a canonical LLD problem because they test algorithm selection (which algorithm for which traffic shape?), interface design (how do you swap algorithms without breaking callers?), concurrency (read-modify-write race conditions), and extensibility (local → distributed without changing the public API).
The naive answer — a counter with a reset timestamp — is the fixed window algorithm. It's simple but has a well-known burst problem: 100 req/min allows 200 requests in a 2-second window spanning a minute boundary. The interviewer's follow-up question is always: "how do you fix that?" The answer requires knowing at least one of: sliding window log, sliding window counter, or token bucket.
The candidate gap: A mid-level engineer implements fixed window and adds a mutex. A senior engineer chooses token bucket or sliding window based on the use case, isolates the algorithm behind an interface, and injects clock + storage for deterministic tests. A staff engineer handles distributed correctness (Lua scripts for atomic Redis operations), failure policy (fail-open vs. fail-closed), and tenant-specific limits via a policy resolver.
What Earns Each Level
6/10: Implements a fixed window counter. Has a bug where two threads can both read the count before either increments — over-allowing under load. Cannot explain the burst problem at window boundaries.
8/10: Implements token bucket with correct refill logic. Isolates the algorithm behind an interface. Injects a clock abstraction for deterministic testing. Adds per-key locking or an atomic CAS loop to prevent the race condition.
10/10: Knows all four algorithms and when to use each (see comparison table). Uses Redis + Lua scripts for atomic distributed reads. Handles the failure policy explicitly: "when Redis is unavailable, do we fail-open or fail-closed?" Implements tenant-specific limits via a PolicyResolver. Defines idempotency for retried requests. Names real systems: Stripe uses token bucket with Redis INCR + Lua, Cloudflare uses sliding window counters, NGINX uses leaky bucket.
SEDIE Walkthrough for Rate Limiter
Scope
Limit requests per key (API key, user ID, IP) to at most N requests per time window. Thread-safe under concurrent requests. Algorithm swappable without changing the caller. Storage swappable (in-memory → Redis) without changing the algorithm.
Extract Entities
RateLimiter (public API: allow(key) → bool). Algorithm interface (allow(key, store, clock) → Decision). Store interface (get, set, atomic_increment, atomic_cas). Clock interface (now() → float). Decision value object (allowed: bool, remaining: int, retry_after_ms: int).
Design Classes
RateLimiter owns: Algorithm, Store, Clock. Algorithm implementations: FixedWindowAlgorithm, TokenBucketAlgorithm, SlidingWindowLogAlgorithm. Store implementations: InMemoryStore (test/local), RedisStore (production). Clock implementations: SystemClock, FakeClock (test).
Implement Critical Methods
Algorithm.allow(key, store, clock): read current state, compute decision, write new state — all atomically. The read-modify-write must be a single atomic operation: use a lock in InMemoryStore, a Lua script in RedisStore. Never read state and write state in two separate non-atomic calls.
Extensibility
Add leaky bucket: new Algorithm implementation only. Add per-tenant limits: PolicyResolver maps key → (limit, window) before calling Algorithm. Add Redis backend: new Store implementation. None of these change RateLimiter, Decision, or existing Algorithm classes.
Class Architecture — Strategy + Dependency Injection
Algorithm Deep Dive — Four Algorithms and When to Use Each
Fixed Window Counter — simplest. Divide time into windows of size W seconds. Maintain a counter per (key, window_id) where window_id = floor(now / W). Increment on each request; allow if counter ≤ limit. Reset at window boundary.
Problem: boundary burst. With limit=100/min, a client sends 100 requests at 11:59:59 and 100 requests at 12:00:01. Both windows show count=100 (within limit), but 200 requests occurred in a 2-second span. Fixed window allows 2× the intended limit across window boundaries.
Sliding Window Log — exact. Maintain a sorted log of request timestamps per key. On each request: (1) remove all entries older than now - window_ms from the log, (2) check len(log) < limit, (3) if allowed, append now to the log. No boundary burst — the window truly slides. Problem: memory cost. At 1,000 req/min, each key's log holds up to 1,000 timestamps. At 1M keys, that's ~1TB in a naive implementation.
Sliding Window Counter — pragmatic compromise. Interpolates between two adjacent fixed windows: count ≈ prev_window_count × (1 - elapsed_fraction) + curr_window_count. Approximate (±0.1% error empirically) but O(1) memory per key. Cloudflare uses this for edge rate limiting.
Token Bucket — burst-friendly. A bucket holds up to capacity tokens. Tokens refill at rate tokens/second. Each request consumes 1 token. If bucket is empty: deny. This is the most flexible algorithm: capacity controls burst size, rate controls sustained throughput. A client can burst to capacity requests immediately after a quiet period. Stripe uses token bucket for API rate limits.
Leaky Bucket — smooth output. Requests queue up; they're processed at a fixed rate (the "leak" rate). This smooths bursty input into constant output. Use case: rate-limit calls to a downstream service to protect it from bursts, not just the client.
Algorithm Comparison — When to Use Each
| Algorithm | Memory per Key | Burst Handling | Accuracy | Real System |
|---|---|---|---|---|
| Fixed Window | O(1) — one counter | Allows 2× burst at boundaries | Approximate | Simple counters in NGINX |
| Sliding Window Log | O(requests/window) | Exact — no boundary burst | Exact | Rarely in prod (memory cost) |
| Sliding Window Counter | O(1) — two counters | Near-exact (~0.1% error) | Near-exact | Cloudflare edge limiting |
| Token Bucket | O(1) — tokens + timestamp | Configurable burst via capacity | Exact for sustained rate | Stripe, AWS API Gateway |
| Leaky Bucket | O(queue size) | Absorbs bursts, smooths output | Exact for output rate | NGINX limit_req module |
Token Bucket + Sliding Window Log Implementations
from __future__ import annotations
import threading
import time
from abc import ABC, abstractmethod
from collections import deque
from dataclasses import dataclass
from typing import Optional
# ── Value objects ──────────────────────────────────────────────────────────
@dataclass(frozen=True)
class RatePolicy:
limit: int # max requests
window_ms: int # window duration in milliseconds
@dataclass(frozen=True)
class Decision:
allowed: bool
remaining: int # tokens/requests remaining (0 if denied)
retry_after_ms: int # ms until a request will be allowed (0 if allowed)
# ── Clock abstraction — critical for deterministic testing ─────────────────
class Clock(ABC):
@abstractmethod
def now_ms(self) -> int: ...
class SystemClock(Clock):
def now_ms(self) -> int:
return int(time.time() * 1000)
class FakeClock(Clock):
def __init__(self, start_ms: int = 0):
self._now = start_ms
def now_ms(self) -> int:
return self._now
def advance(self, delta_ms: int) -> None:
self._now += delta_ms
# ── Store abstraction ──────────────────────────────────────────────────────
class Store(ABC):
@abstractmethod
def get(self, key: str) -> Optional[str]: ...
@abstractmethod
def set(self, key: str, value: str, ttl_ms: int) -> None: ...
@abstractmethod
def delete(self, key: str) -> None: ...
class InMemoryStore(Store):
"""Thread-safe in-memory store for local/test use."""
def __init__(self):
self._data: dict[str, tuple[str, Optional[int]]] = {} # key → (value, expiry_ms)
self._lock = threading.Lock()
def get(self, key: str) -> Optional[str]:
with self._lock:
entry = self._data.get(key)
if entry is None:
return None
value, expiry_ms = entry
if expiry_ms is not None and int(time.time() * 1000) > expiry_ms:
del self._data[key]
return None
return value
def set(self, key: str, value: str, ttl_ms: int) -> None:
with self._lock:
expiry = int(time.time() * 1000) + ttl_ms if ttl_ms > 0 else None
self._data[key] = (value, expiry)
def delete(self, key: str) -> None:
with self._lock:
self._data.pop(key, None)
# ── Algorithm interface ────────────────────────────────────────────────────
class Algorithm(ABC):
@abstractmethod
def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision: ...
@abstractmethod
def reset(self, key: str) -> None: ...
# ── Token Bucket ───────────────────────────────────────────────────────────
class TokenBucketAlgorithm(Algorithm):
"""
Bucket capacity = policy.limit (burst size).
Refill rate = policy.limit tokens per policy.window_ms milliseconds.
Each request consumes 1 token.
State per key: (tokens: float, last_refill_ms: int) stored as "tokens:timestamp".
"""
def __init__(self, store: Store):
self._store = store
self._locks: dict[str, threading.Lock] = {}
self._meta_lock = threading.Lock()
def _key_lock(self, key: str) -> threading.Lock:
with self._meta_lock:
if key not in self._locks:
self._locks[key] = threading.Lock()
return self._locks[key]
def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision:
bucket_key = f"tb:{key}"
with self._key_lock(bucket_key):
now_ms = clock.now_ms()
raw = self._store.get(bucket_key)
if raw is None:
tokens = float(policy.limit)
last_refill = now_ms
else:
parts = raw.split(":")
tokens, last_refill = float(parts[0]), int(parts[1])
# Refill based on elapsed time
elapsed_ms = now_ms - last_refill
refill_rate = policy.limit / policy.window_ms # tokens per ms
tokens = min(float(policy.limit), tokens + elapsed_ms * refill_rate)
last_refill = now_ms
if tokens >= 1.0:
tokens -= 1.0
self._store.set(bucket_key, f"{tokens}:{last_refill}", policy.window_ms * 2)
return Decision(allowed=True, remaining=int(tokens), retry_after_ms=0)
else:
self._store.set(bucket_key, f"{tokens}:{last_refill}", policy.window_ms * 2)
# Time until 1 token is available
wait_ms = int((1.0 - tokens) / refill_rate)
return Decision(allowed=False, remaining=0, retry_after_ms=wait_ms)
def reset(self, key: str) -> None:
self._store.delete(f"tb:{key}")
# ── Sliding Window Log ─────────────────────────────────────────────────────
class SlidingWindowLogAlgorithm(Algorithm):
"""
Exact sliding window: maintains a deque of request timestamps per key.
Memory cost: O(limit) per key (at most `limit` timestamps in the window).
Correct across window boundaries — no burst at window edges.
"""
def __init__(self):
self._logs: dict[str, deque[int]] = {}
self._locks: dict[str, threading.Lock] = {}
self._meta_lock = threading.Lock()
def _key_lock(self, key: str) -> threading.Lock:
with self._meta_lock:
if key not in self._locks:
self._locks[key] = threading.Lock()
return self._locks[key]
def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision:
with self._key_lock(key):
now_ms = clock.now_ms()
window_start = now_ms - policy.window_ms
if key not in self._logs:
self._logs[key] = deque()
log = self._logs[key]
# Evict timestamps outside the sliding window
while log and log[0] <= window_start:
log.popleft()
if len(log) < policy.limit:
log.append(now_ms)
remaining = policy.limit - len(log)
return Decision(allowed=True, remaining=remaining, retry_after_ms=0)
else:
# Earliest request will exit window at: log[0] + window_ms
retry_at = log[0] + policy.window_ms - now_ms
return Decision(allowed=False, remaining=0, retry_after_ms=max(0, retry_at))
def reset(self, key: str) -> None:
with self._meta_lock:
self._logs.pop(key, None)
self._locks.pop(key, None)
# ── PolicyResolver ─────────────────────────────────────────────────────────
class PolicyResolver:
"""Maps a key to its rate policy. Supports per-tenant overrides."""
def __init__(self, default: RatePolicy, overrides: Optional[dict[str, RatePolicy]] = None):
self._default = default
self._overrides = overrides or {}
def resolve(self, key: str) -> RatePolicy:
return self._overrides.get(key, self._default)
# ── RateLimiter — thin orchestrator ────────────────────────────────────────
class RateLimiter:
def __init__(
self,
algorithm: Algorithm,
clock: Clock,
policy_resolver: PolicyResolver,
fail_open: bool = True, # fail-open: allow requests when limiter errors
):
self._algorithm = algorithm
self._clock = clock
self._policy_resolver = policy_resolver
self._fail_open = fail_open
def allow(self, key: str) -> Decision:
try:
policy = self._policy_resolver.resolve(key)
return self._algorithm.allow(key, policy, self._clock)
except Exception:
# Storage/clock failure: apply failure policy
if self._fail_open:
return Decision(allowed=True, remaining=-1, retry_after_ms=0)
else:
return Decision(allowed=False, remaining=0, retry_after_ms=1000)
def reset(self, key: str) -> None:
self._algorithm.reset(key)
Concurrency — The Read-Modify-Write Race
The most critical correctness requirement for a rate limiter is atomicity of the read-modify-write. This is where 90% of naive implementations fail under load.
The race: Two threads A and B both want to check if api_key_123 has tokens remaining. Thread A reads tokens=1, thread B reads tokens=1 (before A writes). A decrements to 0 and writes. B decrements from 1 to 0 and writes. Both requests are allowed. Token was consumed twice — the limiter has been bypassed. At 10K concurrent requests, this can allow a multiple of the configured limit.
Fix for in-process concurrency: Per-key locking. Each key has its own threading.Lock(). The entire read-modify-write cycle runs inside with self._key_lock(key):. Only threads competing for the same key serialize. Threads for different keys run fully in parallel. This is the same per-spot locking principle used in the LRU cache.
Why not a global lock? A global lock serializes all rate limit checks. At 100K requests/second, all threads queue behind one lock — the rate limiter becomes the throughput bottleneck for the entire API tier.
Fix for distributed concurrency (Redis): In-process locks don't span servers. The correct Redis approach uses a Lua script that makes the entire read-modify-write atomic on the Redis server:
-- Lua script for token bucket (runs atomically on Redis — no interleaving possible)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per ms
local now_ms = tonumber(ARGV[3])
local state = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(state[1]) or capacity
local last_refill = tonumber(state[2]) or now_ms
local elapsed = now_ms - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
last_refill = now_ms
if tokens >= 1 then
tokens = tokens - 1
redis.call("HMSET", key, "tokens", tokens, "last_refill", last_refill)
redis.call("PEXPIRE", key, 86400000)
return {1, math.floor(tokens)} -- {allowed, remaining}
else
redis.call("HMSET", key, "tokens", tokens, "last_refill", last_refill)
redis.call("PEXPIRE", key, 86400000)
return {0, 0}
end
Redis executes Lua scripts atomically — no other Redis command can interleave during script execution. This gives distributed rate limiting the same correctness guarantee as a per-key lock in a single process.
Extensibility Matrix — What Changes for Each Extension
| Extension | Classes Changed | Classes Added | Interface Change? |
|---|---|---|---|
| Add leaky bucket algorithm | None | LeakyBucketAlgorithm | No |
| Swap local store → Redis | None | RedisStore | No |
| Per-tenant limits | None | PolicyResolver with tenant map | No |
| Fail-open vs fail-closed | RateLimiter (add fail_open flag) | None | No (same allow() signature) |
| Add retry-after header | Decision (add retry_after_ms) | None | No (same Decision fields) |
| Request idempotency (retry dedup) | None | IdempotencyFilter (wraps RateLimiter) | No |
Production Pitfalls
1. Not injecting the clock. time.time() hardcoded in the algorithm makes it impossible to test time-dependent logic without actually waiting. A FakeClock lets tests advance time programmatically — no sleep, no flakiness.
2. Not handling the failure policy. If Redis is unavailable during rate limit check, what happens? Fail-open (allow the request): appropriate for non-critical APIs where availability > rate correctness. Fail-closed (deny): appropriate for billing-critical or abuse-prevention APIs. Not deciding is deciding — the default behavior becomes whatever the exception propagates to.
3. Not accounting for clock skew in distributed systems. If server A's clock is 50ms ahead of server B's, their token bucket refill calculations diverge. Mitigation: use the Redis server's time (TIME command) as the reference clock, not the application server's clock. This centralizes timekeeping.
4. Retries double-counting tokens. A client sends a request, it's allowed (token consumed), the network times out, client retries. The second attempt consumes another token. For idempotent operations, this is fine. For billing or one-time actions, pass a request_id; the rate limiter checks if this request_id was already allowed (in a short-lived idempotency cache) and returns the same Decision without consuming another token.
5. Choosing fixed window for security-critical paths. The 2× burst at window boundaries is a known vulnerability. An attacker can send 2 × limit requests in 2 seconds by timing requests at the boundary. Token bucket or sliding window is the correct choice for security-sensitive endpoints.
Interview Delivery Summary
Lead with the algorithm choice question: "Before choosing an algorithm, I need to know the traffic pattern. Is burst traffic expected and acceptable (use token bucket — configurable burst) or should output be strictly smooth (leaky bucket)? Is exact accuracy at window boundaries required (sliding window log) or is a ~0.1% approximation acceptable (sliding window counter)?"
State the concurrency requirement early: "Rate limiting under concurrency is a read-modify-write problem. The check (are tokens available?) and the write (consume a token) must be atomic — or two concurrent requests can both pass the check and both consume the last token."
Demonstrate testability: "I inject a FakeClock and InMemoryStore. This makes every test deterministic — I advance the fake clock by exactly the refill interval and assert the token count. No real time.sleep() calls."
Staff signal: "In production at multiple servers, per-key in-process locks are insufficient — they don't span processes. The correct approach is a Redis Lua script for atomic read-modify-write. Stripe's production rate limiter uses exactly this pattern. The Algorithm.allow() interface stays the same — only the Store implementation changes from InMemoryStore to RedisStore."