Sections

Design a thread-safe, extensible rate limiter supporting token bucket and sliding window log algorithms, with clean interfaces, deterministic testability, and a clear migration path from a single-process limiter to a distributed Redis-backed cluster. Covers the exact concurrency pitfalls — race conditions in read-modify-write, thundering-herd on cold buckets — and how production systems like Stripe, Cloudflare, and Kong solve them.

45 min read 11 sections 7 interview questions

Rate LimiterToken BucketSliding WindowFixed WindowStrategy PatternThread SafetyLLDAPI DesignLock StripingRedis LuaIdempotencyLeaky BucketDistributed Rate LimitingAPI GatewayFail-Open

Why Rate Limiters Are a Core LLD Interview Problem

Rate limiters appear at every major API gateway: Stripe limits API calls per key, GitHub throttles REST and GraphQL separately, Kong and Nginx enforce request-per-second at the edge. They are a canonical LLD problem because they test algorithm selection (which algorithm for which traffic shape?), interface design (how do you swap algorithms without breaking callers?), concurrency (read-modify-write race conditions), and extensibility (local → distributed without changing the public API).

The naive answer — a counter with a reset timestamp — is the fixed window algorithm. It's simple but has a well-known burst problem: 100 req/min allows 200 requests in a 2-second window spanning a minute boundary. The interviewer's follow-up question is always: "how do you fix that?" The answer requires knowing at least one of: sliding window log, sliding window counter, or token bucket.

The candidate gap: A mid-level engineer implements fixed window and adds a mutex. A senior engineer chooses token bucket or sliding window based on the use case, isolates the algorithm behind an interface, and injects clock + storage for deterministic tests. A staff engineer handles distributed correctness (Lua scripts for atomic Redis operations), failure policy (fail-open vs. fail-closed), and tenant-specific limits via a policy resolver.

IMPORTANT

What Earns Each Level

6/10: Implements a fixed window counter. Has a bug where two threads can both read the count before either increments — over-allowing under load. Cannot explain the burst problem at window boundaries.

8/10: Implements token bucket with correct refill logic. Isolates the algorithm behind an interface. Injects a clock abstraction for deterministic testing. Adds per-key locking or an atomic CAS loop to prevent the race condition.

10/10: Knows all four algorithms and when to use each (see comparison table). Uses Redis + Lua scripts for atomic distributed reads. Handles the failure policy explicitly: "when Redis is unavailable, do we fail-open or fail-closed?" Implements tenant-specific limits via a PolicyResolver. Defines idempotency for retried requests. Names real systems: Stripe uses token bucket with Redis INCR + Lua, Cloudflare uses sliding window counters, NGINX uses leaky bucket.

SEDIE Walkthrough for Rate Limiter

Scope

Limit requests per key (API key, user ID, IP) to at most N requests per time window. Thread-safe under concurrent requests. Algorithm swappable without changing the caller. Storage swappable (in-memory → Redis) without changing the algorithm.

Extract Entities

RateLimiter (public API: allow(key) → bool). Algorithm interface (allow(key, store, clock) → Decision). Store interface (get, set, atomic_increment, atomic_cas). Clock interface (now() → float). Decision value object (allowed: bool, remaining: int, retry_after_ms: int).

Design Classes

RateLimiter owns: Algorithm, Store, Clock. Algorithm implementations: FixedWindowAlgorithm, TokenBucketAlgorithm, SlidingWindowLogAlgorithm. Store implementations: InMemoryStore (test/local), RedisStore (production). Clock implementations: SystemClock, FakeClock (test).

Implement Critical Methods

Algorithm.allow(key, store, clock): read current state, compute decision, write new state — all atomically. The read-modify-write must be a single atomic operation: use a lock in InMemoryStore, a Lua script in RedisStore. Never read state and write state in two separate non-atomic calls.

Extensibility

Add leaky bucket: new Algorithm implementation only. Add per-tenant limits: PolicyResolver maps key → (limit, window) before calling Algorithm. Add Redis backend: new Store implementation. None of these change RateLimiter, Decision, or existing Algorithm classes.

Class Architecture — Strategy + Dependency Injection

Rendering diagram...

Algorithm Deep Dive — Four Algorithms and When to Use Each

Fixed Window Counter — simplest. Divide time into windows of size W seconds. Maintain a counter per (key, window_id) where window_id = floor(now / W). Increment on each request; allow if counter ≤ limit. Reset at window boundary.

Problem: boundary burst. With limit=100/min, a client sends 100 requests at 11:59:59 and 100 requests at 12:00:01. Both windows show count=100 (within limit), but 200 requests occurred in a 2-second span. Fixed window allows 2× the intended limit across window boundaries.

Sliding Window Log — exact. Maintain a sorted log of request timestamps per key. On each request: (1) remove all entries older than now - window_ms from the log, (2) check len(log) < limit, (3) if allowed, append now to the log. No boundary burst — the window truly slides. Problem: memory cost. At 1,000 req/min, each key's log holds up to 1,000 timestamps. At 1M keys, that's ~1TB in a naive implementation.

Sliding Window Counter — pragmatic compromise. Interpolates between two adjacent fixed windows: count ≈ prev_window_count × (1 - elapsed_fraction) + curr_window_count. Approximate (±0.1% error empirically) but O(1) memory per key. Cloudflare uses this for edge rate limiting.

Token Bucket — burst-friendly. A bucket holds up to capacity tokens. Tokens refill at rate tokens/second. Each request consumes 1 token. If bucket is empty: deny. This is the most flexible algorithm: capacity controls burst size, rate controls sustained throughput. A client can burst to capacity requests immediately after a quiet period. Stripe uses token bucket for API rate limits.

Leaky Bucket — smooth output. Requests queue up; they're processed at a fixed rate (the "leak" rate). This smooths bursty input into constant output. Use case: rate-limit calls to a downstream service to protect it from bursts, not just the client.

Algorithm Comparison — When to Use Each

Algorithm	Memory per Key	Burst Handling	Accuracy	Real System
Fixed Window	O(1) — one counter	Allows 2× burst at boundaries	Approximate	Simple counters in NGINX
Sliding Window Log	O(requests/window)	Exact — no boundary burst	Exact	Rarely in prod (memory cost)
Sliding Window Counter	O(1) — two counters	Near-exact (~0.1% error)	Near-exact	Cloudflare edge limiting
Token Bucket	O(1) — tokens + timestamp	Configurable burst via capacity	Exact for sustained rate	Stripe, AWS API Gateway
Leaky Bucket	O(queue size)	Absorbs bursts, smooths output	Exact for output rate	NGINX limit_req module

Token Bucket + Sliding Window Log Implementations

pythonrate_limiter.py

from __future__ import annotations
import threading
import time
from abc import ABC, abstractmethod
from collections import deque
from dataclasses import dataclass
from typing import Optional


# ── Value objects ──────────────────────────────────────────────────────────

@dataclass(frozen=True)
class RatePolicy:
    limit: int       # max requests
    window_ms: int   # window duration in milliseconds


@dataclass(frozen=True)
class Decision:
    allowed: bool
    remaining: int           # tokens/requests remaining (0 if denied)
    retry_after_ms: int      # ms until a request will be allowed (0 if allowed)


# ── Clock abstraction — critical for deterministic testing ─────────────────

class Clock(ABC):
    @abstractmethod
    def now_ms(self) -> int: ...

class SystemClock(Clock):
    def now_ms(self) -> int:
        return int(time.time() * 1000)

class FakeClock(Clock):
    def __init__(self, start_ms: int = 0):
        self._now = start_ms
    def now_ms(self) -> int:
        return self._now
    def advance(self, delta_ms: int) -> None:
        self._now += delta_ms


# ── Store abstraction ──────────────────────────────────────────────────────

class Store(ABC):
    @abstractmethod
    def get(self, key: str) -> Optional[str]: ...
    @abstractmethod
    def set(self, key: str, value: str, ttl_ms: int) -> None: ...
    @abstractmethod
    def delete(self, key: str) -> None: ...


class InMemoryStore(Store):
    """Thread-safe in-memory store for local/test use."""

    def __init__(self):
        self._data: dict[str, tuple[str, Optional[int]]] = {}  # key → (value, expiry_ms)
        self._lock = threading.Lock()

    def get(self, key: str) -> Optional[str]:
        with self._lock:
            entry = self._data.get(key)
            if entry is None:
                return None
            value, expiry_ms = entry
            if expiry_ms is not None and int(time.time() * 1000) > expiry_ms:
                del self._data[key]
                return None
            return value

    def set(self, key: str, value: str, ttl_ms: int) -> None:
        with self._lock:
            expiry = int(time.time() * 1000) + ttl_ms if ttl_ms > 0 else None
            self._data[key] = (value, expiry)

    def delete(self, key: str) -> None:
        with self._lock:
            self._data.pop(key, None)


# ── Algorithm interface ────────────────────────────────────────────────────

class Algorithm(ABC):
    @abstractmethod
    def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision: ...
    @abstractmethod
    def reset(self, key: str) -> None: ...


# ── Token Bucket ───────────────────────────────────────────────────────────

class TokenBucketAlgorithm(Algorithm):
    """
    Bucket capacity = policy.limit (burst size).
    Refill rate = policy.limit tokens per policy.window_ms milliseconds.
    Each request consumes 1 token.
    State per key: (tokens: float, last_refill_ms: int) stored as "tokens:timestamp".
    """

    def __init__(self, store: Store):
        self._store = store
        self._locks: dict[str, threading.Lock] = {}
        self._meta_lock = threading.Lock()

    def _key_lock(self, key: str) -> threading.Lock:
        with self._meta_lock:
            if key not in self._locks:
                self._locks[key] = threading.Lock()
            return self._locks[key]

    def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision:
        bucket_key = f"tb:{key}"
        with self._key_lock(bucket_key):
            now_ms = clock.now_ms()
            raw = self._store.get(bucket_key)

            if raw is None:
                tokens = float(policy.limit)
                last_refill = now_ms
            else:
                parts = raw.split(":")
                tokens, last_refill = float(parts[0]), int(parts[1])

            # Refill based on elapsed time
            elapsed_ms = now_ms - last_refill
            refill_rate = policy.limit / policy.window_ms  # tokens per ms
            tokens = min(float(policy.limit), tokens + elapsed_ms * refill_rate)
            last_refill = now_ms

            if tokens >= 1.0:
                tokens -= 1.0
                self._store.set(bucket_key, f"{tokens}:{last_refill}", policy.window_ms * 2)
                return Decision(allowed=True, remaining=int(tokens), retry_after_ms=0)
            else:
                self._store.set(bucket_key, f"{tokens}:{last_refill}", policy.window_ms * 2)
                # Time until 1 token is available
                wait_ms = int((1.0 - tokens) / refill_rate)
                return Decision(allowed=False, remaining=0, retry_after_ms=wait_ms)

    def reset(self, key: str) -> None:
        self._store.delete(f"tb:{key}")


# ── Sliding Window Log ─────────────────────────────────────────────────────

class SlidingWindowLogAlgorithm(Algorithm):
    """
    Exact sliding window: maintains a deque of request timestamps per key.
    Memory cost: O(limit) per key (at most `limit` timestamps in the window).
    Correct across window boundaries — no burst at window edges.
    """

    def __init__(self):
        self._logs: dict[str, deque[int]] = {}
        self._locks: dict[str, threading.Lock] = {}
        self._meta_lock = threading.Lock()

    def _key_lock(self, key: str) -> threading.Lock:
        with self._meta_lock:
            if key not in self._locks:
                self._locks[key] = threading.Lock()
            return self._locks[key]

    def allow(self, key: str, policy: RatePolicy, clock: Clock) -> Decision:
        with self._key_lock(key):
            now_ms = clock.now_ms()
            window_start = now_ms - policy.window_ms

            if key not in self._logs:
                self._logs[key] = deque()

            log = self._logs[key]

            # Evict timestamps outside the sliding window
            while log and log[0] <= window_start:
                log.popleft()

            if len(log) < policy.limit:
                log.append(now_ms)
                remaining = policy.limit - len(log)
                return Decision(allowed=True, remaining=remaining, retry_after_ms=0)
            else:
                # Earliest request will exit window at: log[0] + window_ms
                retry_at = log[0] + policy.window_ms - now_ms
                return Decision(allowed=False, remaining=0, retry_after_ms=max(0, retry_at))

    def reset(self, key: str) -> None:
        with self._meta_lock:
            self._logs.pop(key, None)
            self._locks.pop(key, None)


# ── PolicyResolver ─────────────────────────────────────────────────────────

class PolicyResolver:
    """Maps a key to its rate policy. Supports per-tenant overrides."""

    def __init__(self, default: RatePolicy, overrides: Optional[dict[str, RatePolicy]] = None):
        self._default = default
        self._overrides = overrides or {}

    def resolve(self, key: str) -> RatePolicy:
        return self._overrides.get(key, self._default)


# ── RateLimiter — thin orchestrator ────────────────────────────────────────

class RateLimiter:
    def __init__(
        self,
        algorithm: Algorithm,
        clock: Clock,
        policy_resolver: PolicyResolver,
        fail_open: bool = True,  # fail-open: allow requests when limiter errors
    ):
        self._algorithm = algorithm
        self._clock = clock
        self._policy_resolver = policy_resolver
        self._fail_open = fail_open

    def allow(self, key: str) -> Decision:
        try:
            policy = self._policy_resolver.resolve(key)
            return self._algorithm.allow(key, policy, self._clock)
        except Exception:
            # Storage/clock failure: apply failure policy
            if self._fail_open:
                return Decision(allowed=True, remaining=-1, retry_after_ms=0)
            else:
                return Decision(allowed=False, remaining=0, retry_after_ms=1000)

    def reset(self, key: str) -> None:
        self._algorithm.reset(key)

Concurrency — The Read-Modify-Write Race

The most critical correctness requirement for a rate limiter is atomicity of the read-modify-write. This is where 90% of naive implementations fail under load.

The race: Two threads A and B both want to check if api_key_123 has tokens remaining. Thread A reads tokens=1, thread B reads tokens=1 (before A writes). A decrements to 0 and writes. B decrements from 1 to 0 and writes. Both requests are allowed. Token was consumed twice — the limiter has been bypassed. At 10K concurrent requests, this can allow a multiple of the configured limit.

Fix for in-process concurrency: Per-key locking. Each key has its own threading.Lock(). The entire read-modify-write cycle runs inside with self._key_lock(key):. Only threads competing for the same key serialize. Threads for different keys run fully in parallel. This is the same per-spot locking principle used in the LRU cache.

Why not a global lock? A global lock serializes all rate limit checks. At 100K requests/second, all threads queue behind one lock — the rate limiter becomes the throughput bottleneck for the entire API tier.

Fix for distributed concurrency (Redis): In-process locks don't span servers. The correct Redis approach uses a Lua script that makes the entire read-modify-write atomic on the Redis server:

-- Lua script for token bucket (runs atomically on Redis — no interleaving possible)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per ms
local now_ms = tonumber(ARGV[3])

local state = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(state[1]) or capacity
local last_refill = tonumber(state[2]) or now_ms

local elapsed = now_ms - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
last_refill = now_ms

if tokens >= 1 then
  tokens = tokens - 1
  redis.call("HMSET", key, "tokens", tokens, "last_refill", last_refill)
  redis.call("PEXPIRE", key, 86400000)
  return {1, math.floor(tokens)}   -- {allowed, remaining}
else
  redis.call("HMSET", key, "tokens", tokens, "last_refill", last_refill)
  redis.call("PEXPIRE", key, 86400000)
  return {0, 0}
end

Redis executes Lua scripts atomically — no other Redis command can interleave during script execution. This gives distributed rate limiting the same correctness guarantee as a per-key lock in a single process.

Extensibility Matrix — What Changes for Each Extension

Extension	Classes Changed	Classes Added	Interface Change?
Add leaky bucket algorithm	None	LeakyBucketAlgorithm	No
Swap local store → Redis	None	RedisStore	No
Per-tenant limits	None	PolicyResolver with tenant map	No
Fail-open vs fail-closed	RateLimiter (add fail_open flag)	None	No (same allow() signature)
Add retry-after header	Decision (add retry_after_ms)	None	No (same Decision fields)
Request idempotency (retry dedup)	None	IdempotencyFilter (wraps RateLimiter)	No

⚠ WARNING

Production Pitfalls

1. Not injecting the clock. time.time() hardcoded in the algorithm makes it impossible to test time-dependent logic without actually waiting. A FakeClock lets tests advance time programmatically — no sleep, no flakiness.

2. Not handling the failure policy. If Redis is unavailable during rate limit check, what happens? Fail-open (allow the request): appropriate for non-critical APIs where availability > rate correctness. Fail-closed (deny): appropriate for billing-critical or abuse-prevention APIs. Not deciding is deciding — the default behavior becomes whatever the exception propagates to.

3. Not accounting for clock skew in distributed systems. If server A's clock is 50ms ahead of server B's, their token bucket refill calculations diverge. Mitigation: use the Redis server's time (TIME command) as the reference clock, not the application server's clock. This centralizes timekeeping.

4. Retries double-counting tokens. A client sends a request, it's allowed (token consumed), the network times out, client retries. The second attempt consumes another token. For idempotent operations, this is fine. For billing or one-time actions, pass a request_id; the rate limiter checks if this request_id was already allowed (in a short-lived idempotency cache) and returns the same Decision without consuming another token.

5. Choosing fixed window for security-critical paths. The 2× burst at window boundaries is a known vulnerability. An attacker can send 2 × limit requests in 2 seconds by timing requests at the boundary. Token bucket or sliding window is the correct choice for security-sensitive endpoints.

TIP

Interview Delivery Summary

Lead with the algorithm choice question: "Before choosing an algorithm, I need to know the traffic pattern. Is burst traffic expected and acceptable (use token bucket — configurable burst) or should output be strictly smooth (leaky bucket)? Is exact accuracy at window boundaries required (sliding window log) or is a ~0.1% approximation acceptable (sliding window counter)?"

State the concurrency requirement early: "Rate limiting under concurrency is a read-modify-write problem. The check (are tokens available?) and the write (consume a token) must be atomic — or two concurrent requests can both pass the check and both consume the last token."

Demonstrate testability: "I inject a FakeClock and InMemoryStore. This makes every test deterministic — I advance the fake clock by exactly the refill interval and assert the token count. No real time.sleep() calls."

Staff signal: "In production at multiple servers, per-key in-process locks are insufficient — they don't span processes. The correct approach is a Redis Lua script for atomic read-modify-write. Stripe's production rate limiter uses exactly this pattern. The Algorithm.allow() interface stays the same — only the Store implementation changes from InMemoryStore to RedisStore."

Interview Questions

Click to reveal answers