Sections

0/11

Related Guides

Load Balancer Design: L4/L7 Routing, Health Checks, and Failover

High-Level Design

40m

API Gateway Design: Auth, Rate Limiting, Routing, and BFF Patterns

High-Level Design

50m

Caching: Strategy, Redis Internals & Distributed Patterns

High-Level Design

45m

Distributed Systems Patterns

High-Level Design

40m

Quiz

← Back to Library

High-Level Design·Intermediate

Rate Limiter Design: Token Bucket, Sliding Window, and Distributed Enforcement

Design distributed rate limiters for APIs and gateways. Covers all five algorithm tradeoffs (token bucket, leaky bucket, fixed window, sliding window log, sliding window counter), Redis data structure choices, hot-key mitigation, race conditions, and multi-region consistency. One of the most frequently asked HLD fundamentals at FAANG.

42 min read 11 sections 6 interview questions

Rate LimiterToken BucketLeaky BucketSliding WindowFixed WindowRedisLua ScriptsAPI GatewayDistributed SystemsHot KeysRace ConditionsBurst ControlMulti-RegionX-RateLimit Headers

Why Rate Limiting Is One of the Most-Tested HLD Primitives

Rate limiting appears in almost every system design interview because it sits at the intersection of correctness, performance, and distributed consistency — three properties that conflict in interesting ways.

The problem sounds simple: allow a user to make at most N requests per time window T. The implementation reveals depth at every layer: choosing the algorithm (token bucket vs sliding window), handling distributed state (one Redis cluster vs per-region counters), dealing with race conditions (atomic Lua scripts vs Redis WATCH/MULTI), and degrading gracefully when the rate limiter itself is unavailable.

Why the algorithm choice matters: a fixed-window counter allows 2× the rate limit in burst — a user gets N requests at the end of window 1 and another N at the start of window 2, for 2N requests in the window boundary zone. This is the boundary burst problem and it is the single most common interview follow-up. Any candidate who proposes fixed-window without addressing this gets marked down.

The distributed challenge: a single-machine rate limiter is trivial — an in-memory token bucket or counter. The hard problem is ensuring consistency when traffic is distributed across 10 API gateway nodes. If each node tracks its own counter, a user can hit N/10 requests on each of 10 nodes, totaling N requests with no rate limiting enforcement. The solution requires shared state — typically Redis — and atomic operations to avoid race conditions.

What most candidates miss: the rate limiter must fail open gracefully. If Redis is unreachable, the correct default for most production systems is to allow requests through (fail open) rather than reject all traffic (fail closed), because the rate limiter outage should not take down the API. This is a conscious tradeoff between abuse prevention and availability.

IMPORTANT

What Interviewers Evaluate

9/10 answer: Immediately identifies the boundary burst problem with fixed-window; explains why Redis INCR + EXPIRE has a race condition and how Lua scripts fix it atomically; proposes sliding window counter as the best general tradeoff; discusses fail-open vs fail-closed; mentions X-RateLimit response headers; addresses the hot-key problem for high-traffic users.

6/10 answer: Proposes token bucket or Redis INCR without discussing race conditions or the boundary burst problem; treats the rate limiter as stateless; doesn't mention what happens when Redis is down.

What impresses: knowing that Redis INCR is not atomic with TTL setting (race condition window between INCR and EXPIRE); describing the Lua script that makes INCR + conditional EXPIRE atomic in one round trip; mentioning Redis cell (generic cell rate limiting) or the redis-rate-limiter pattern; explaining local counters + periodic Redis sync for high-throughput scenarios (10K+ RPS per key).

RACED Framework Applied to Rate Limiter

Requirements

Functional: enforce per-user / per-IP / per-API-key limits; configurable window (1s, 1min, 1hr); burst allowance; return standard headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After on 429); exempt internal traffic. Non-functional: decision latency < 5ms p99 (added overhead to every request); availability 99.99% (must not become the SPOF); throughput: handle 10K+ RPS checking decisions; consistency: strong enough to prevent 2× burst; scale: millions of unique rate limit keys.

API & Entities

Core API: check_rate_limit(key: str, policy: Policy) → RateLimitResult. Result: {allowed: bool, remaining: int, reset_at: epoch_ms, retry_after_ms: int}. Policy: {limit: int, window_sec: int, burst_multiplier: float}. Key hierarchy: {user_id}:{api}:{window} — enables per-user, per-endpoint, per-window granularity.

Core Design

Gateway plugin calls the rate limiter service synchronously on every request. Rate limiter checks Redis (atomic Lua script) and returns allow/deny + headers. For very high throughput (>100K RPS per key), use local token bucket counters in the gateway process with periodic Redis sync to reduce Redis round trips.

Escalation (deep dives)

Algorithm choice (boundary burst, accuracy vs memory); race conditions (INCR + EXPIRE atomicity); hot keys (Redis cluster key distribution); distributed enforcement (per-region vs global counters); fail-open vs fail-closed; multi-tier limits (per-second burst + per-hour sustained).

Durability

Rate limit state is ephemeral by design — losing counter state means a brief window of unconstrained access, not data loss. Redis AOF or RDB persistence is optional. For strict abuse prevention, Redis replica + failover. For relaxed policies (marketing APIs), in-memory counters with Redis as optional sync.

The Five Algorithms: Accuracy vs Memory vs Burst

Fixed Window Counter: divide time into fixed windows (e.g., 1-minute slots: 12:00–12:01, 12:01–12:02). Count requests per window. Reset counter at window boundary. Implementation: Redis key ratelimit:{user}:{window_start} with INCR and EXPIRE at window_start + window_duration.

Problem — boundary burst: a user sends N requests at 12:00:59 (within window 1) and another N at 12:01:01 (within window 2). Both are within limits, but in a 2-second span the user sent 2N requests — twice the intended rate. This is the canonical fixed-window failure mode.

Sliding Window Log: store a timestamp for every request in a sorted set. On each new request, remove all timestamps older than now - window and count the remaining. If count < limit, allow and add the new timestamp. Accurate — no boundary burst. Cost: O(limit) memory per user (storing every timestamp); for 1000 req/min, that's 1000 entries per user. At 1M users this is 1B entries. Impractical for large-scale APIs.

Sliding Window Counter (the best general-purpose algorithm): approximates the sliding window using two fixed windows. Current window count + (previous window count × fraction of window remaining). Example: limit=100/min, current window (12:01–12:02) has 30 requests, previous window (12:00–12:01) had 70 requests. 40% of the current window has elapsed. Effective count = 30 + 70 × (1 - 0.4) = 30 + 42 = 72. Allow if 72 < 100. Memory: 2 counters per user (current + previous window). Accuracy: within ~1% of true sliding window. This is what Cloudflare, GitHub, and most production systems use.

Token Bucket: a "bucket" holds up to capacity tokens. Tokens refill at rate tokens/sec. Each request consumes one token. If the bucket is empty, deny the request. Implementation requires storing {current_tokens, last_refill_time} per user and computing tokens added = (now - last_refill) × rate on each check. Pros: naturally allows bursts (up to capacity tokens can be consumed instantly). Cons: more state, requires floating-point arithmetic in Lua, drift in distributed scenarios.

Leaky Bucket: requests enter a FIFO queue (the "bucket"); a worker drains the queue at a fixed rate. Excess requests that overflow the queue are rejected. Smooths bursty traffic into a constant output rate. Used in network devices (traffic shaping). For API rate limiting: the queue adds queuing latency and complexity without the burst-handling benefit of token bucket. Rarely the right choice for API gateways.

Algorithm Comparison: Five Rate Limiting Approaches

Algorithm	Memory per Key	Handles Bursts	Boundary Burst	Best For
Fixed Window Counter	O(1) — 1 integer	No — flat limit	Yes — 2× spike at boundary	Internal low-risk limits, simple cases
Sliding Window Log	O(limit) — 1 set	Yes — accurate	No — exact sliding window	Strict abuse prevention where memory is acceptable
Sliding Window Counter	O(1) — 2 integers	Approximate	~1% approximation error	General-purpose API limits — production default
Token Bucket	O(1) — 2 floats	Yes — up to capacity	No — token count is continuous	Public APIs needing burst allowance (e.g., SDK clients)
Leaky Bucket	O(queue_size)	No — smooths to rate	No — constant drain	Network traffic shaping, not typical API rate limiting

Distributed Rate Limiter Architecture

Rendering diagram...

Race Conditions and Atomic Redis Operations

The naive implementation of a Redis rate limiter has a critical race condition. Consider this pseudo-code:

count = INCR ratelimit:{user}:{window}
if count == 1:
    EXPIRE ratelimit:{user}:{window} 60
if count > limit:
    return deny
return allow

The race: two requests arrive simultaneously. Both execute INCR and get count=1 and count=2. Both try to set EXPIRE. This looks fine — but what if the key already existed from a previous window without a TTL? Both requests check count=1 and count=2 but skip the EXPIRE (because count != 1 for the second). The key never gets a TTL and the counter grows forever.

A subtler race: between INCR and EXPIRE, the process crashes. The key now exists with no expiry — permanent rate limiting for this user.

Fix: Lua scripts. Redis executes Lua scripts atomically — no other Redis command runs between lines of the script. The following script is the canonical sliding window counter implementation:

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window)
end
return {count, redis.call('TTL', key)}

This script increments the counter and sets the TTL atomically. No race condition is possible. The caller receives {count, ttl} and decides allow/deny + computes X-RateLimit-Remaining and X-RateLimit-Reset.

Alternative: Redis SET with NX and EX. For token bucket: SET key value NX EX ttl sets the key only if it doesn't exist, with a TTL in one atomic command. Combined with DECRBY in a pipeline, this implements token bucket without Lua. However, updating tokens without a Lua script still requires WATCH/MULTI/EXEC transactions, which are less efficient than Lua under contention.

Hot Keys and Multi-Region Rate Limiting

Hot key problem: a viral API endpoint or a single power user generates 100K+ RPS. All rate limit checks for this user hash to the same Redis shard, creating a hot partition. The shard handles ~100K ops/sec, but a single Redis instance handles ~100K ops/sec total — one hot key consumes the entire shard's capacity.

Solutions — three options:

1. Key sharding with local aggregation: split the rate limit key across N virtual shards: ratelimit:{user}:{window}:{random(0,N)}. Each gateway node picks a random shard. The effective limit per shard is total_limit / N. This reduces per-shard load by N× but under-counts when a single gateway node hits different shards (a node might allow limit/N requests per shard, totaling limit across N shards from one node). Use for approximate enforcement where slight over-counting is acceptable.

2. Local counter with periodic Redis sync: each gateway node maintains a local in-memory counter. Every 100ms, it publishes its local count to Redis (INCRBY) and reads the global count back. Between syncs, requests are checked against the local estimate. This reduces Redis calls by 100× but introduces up to 100ms of over-counting window. Appropriate for per-minute or per-hour limits where 100ms of over-counting is acceptable. Twitter and Stripe use variants of this approach.

3. Dedicated rate limit service with connection pooling: instead of each gateway node directly calling Redis, a dedicated rate limiter service (10-instance fleet) handles all Redis calls. Gateway nodes call the rate limiter service via gRPC with keepalive connections. The service maintains Redis connection pools efficiently. This reduces the number of Redis connections from (gateway_nodes × connections_per_node) to (rl_service_nodes × connections_per_node) — important when gateway autoscales to 1000 nodes.

Multi-region rate limiting:

For global APIs, a user in EU and a user in US both hit the same global limit. Strong consistency (cross-region Redis) adds ~100ms RTT per request — unacceptable.

Recommended approach: rate limiting is enforced per-region with a fraction of the global limit. Allocate global_limit / num_regions to each region. For a 1000 req/min global limit across 3 regions: each region allows 333 req/min. A user cannot exceed the global limit by spreading requests unless they are load-balanced across regions, which is uncommon. For strict global enforcement, use periodic quota reconciliation: every minute, regions report usage to a central store and adjust their allocation for the next window.

⚠ WARNING

Failure Modes to Address in Interviews

Race condition on INCR + EXPIRE: always use Lua scripts or WATCH/MULTI/EXEC to make counter increment and TTL set atomic. Naive sequential calls create a window where the key has no TTL.

Fail-closed vs fail-open: if Redis is unreachable, should you allow or deny requests? Fail-closed (deny all) prevents abuse but takes down your API during Redis outages. Fail-open (allow all) preserves availability but allows unbounded traffic during outages. Production default: fail open with circuit breaker + alerting. Add a secondary in-memory local limit as a backstop.

Clock skew in distributed systems: sliding window calculations rely on wall-clock time. If gateway nodes have clock skew > 1 second, different nodes may be in different windows, leading to effective limit = limit × 2 at window boundaries for skewed nodes. Fix: use Redis server time (TIME command) as the authoritative clock, not the gateway's local clock.

Retry storms after 429: naive clients that retry immediately on 429 create a retry storm that amplifies load on the rate limiter. Always return Retry-After header with the seconds-until-reset. Enforce exponential backoff in your client SDKs. Document the retry policy in your API reference.

Rate Limit Response Headers Standard

Header	Value Format	Meaning
X-RateLimit-Limit	1000	Maximum requests allowed in the window
X-RateLimit-Remaining	47	Requests remaining in the current window
X-RateLimit-Reset	1714040460 (Unix epoch)	Time when the window resets (UTC epoch seconds)
X-RateLimit-Window	60	Window duration in seconds
Retry-After	53 (seconds)	Only on 429: how long to wait before retrying
X-RateLimit-Policy	100;w=1, 1000;w=60	RFC 6585 multi-policy format: burst=100/sec, sustained=1000/min

TIP

Interview Summary — What to Say

Open with the algorithm choice: "My default is sliding window counter — it uses O(1) memory, approximates a true sliding window within 1%, and eliminates the boundary burst problem of fixed windows. Token bucket if the use case needs burst allowance for SDK clients."

On Redis: "I'll use a Lua script to make INCR + EXPIRE atomic in one Redis round trip — the naive two-command approach has a race condition where the key can be left without a TTL."

On distributed: "Each gateway node calls the same Redis cluster. For hot keys (power users or viral endpoints), I'd use local in-memory counters with 100ms periodic Redis sync — reduces Redis calls by 100× at the cost of 100ms over-counting, acceptable for per-minute limits."

On failure: "If Redis is unreachable, I fail open — rate limiting is a best-effort abuse prevention mechanism, not a correctness requirement. I add circuit breaker + alerting and a local backstop limit."

Staff-level closer: "Multi-region: I'd allocate per-region quota slices (global_limit / num_regions) to avoid cross-region round trips. For strict global limits on financial APIs, I'd use a dedicated quota service with eventual consistency and allow brief over-quota periods bounded by the reconciliation interval."

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.