Sections
Related Guides
API Gateway Design: Auth, Rate Limiting, Routing, and BFF Patterns
High-Level Design
Rate Limiter Design: Token Bucket, Sliding Window, and Distributed Enforcement
High-Level Design
CDN: Edge Caching, Push vs Pull, and Invalidation at Global Scale
High-Level Design
Microservices Architecture: Decomposition, Service Mesh, Circuit Breakers & Saga Pattern
High-Level Design
Load Balancer Design: L4/L7 Routing, Health Checks, and Failover
Design scalable load balancing for modern distributed systems. Covers L4 vs L7 tradeoffs, routing algorithms (round-robin, least-connections, P2C, consistent hashing), health check design, connection draining, sticky sessions, and global load balancing with GeoDNS and Anycast. Builds the mental model interviewers use to assess system design maturity.
Why Load Balancers Are a Core HLD Building Block
Every scalable system has a load balancer — but most candidates treat it as a box to draw and move on. Interviewers use load balancer questions to probe whether you understand traffic distribution, failure isolation, and the difference between routing policy and health policy.
A load balancer solves three distinct problems:
- Horizontal scaling: distribute incoming requests across N backend instances so no single instance is overloaded.
- Failure isolation: detect unhealthy backends and stop sending traffic to them before clients notice.
- Policy enforcement: apply routing decisions based on request content (L7), connection metadata (L4), or geographic proximity — not just availability.
The fundamental tradeoff that every interview question probes: more intelligent routing requires more context, which requires more CPU and state. An L4 load balancer makes a routing decision in microseconds with zero knowledge of the application. An L7 load balancer can route based on URL path, HTTP headers, and JWT claims — but it must terminate TLS, parse the HTTP request, and maintain connection state. That overhead is worth it for API routing, session affinity, and canary deployments — and it is not worth it for raw TCP throughput at 10M+ connections/sec.
What most candidates miss: the load balancer itself is a single point of failure unless you design for it. Active-active pairs with BGP Anycast or ECMP eliminate this; single-LB deployments move the SPOF from the backend to the LB tier.
What Interviewers Evaluate
9/10 answer: Immediately asks whether the system is HTTP/HTTPS API traffic (L7) or raw TCP/UDP (L4); distinguishes routing algorithms by use case rather than reciting them abstractly; explains why health check interval + unhealthy threshold creates a detection window (not instant failover); mentions connection draining for zero-downtime deploys; discusses the LB tier's own HA design.
6/10 answer: Draws a box labeled "Load Balancer," mentions round-robin, and moves to the next component. Does not discuss health checks, algorithm selection rationale, or LB failover.
What impresses: knowing the Power of Two Choices (P2C) algorithm by name and why it statistically outperforms least-connections; discussing connection draining timeouts (typically 30–60s) in the context of deployment safety; explaining why consistent hashing is needed for stateful backends (e.g., WebSocket servers, game servers) but not stateless APIs.
RACED Framework Applied to Load Balancer
Requirements
Functional: distribute HTTP/TCP traffic across backend instances, health-check backends, route based on policy (path-based, header-based, weighted), drain connections on deploy, support sticky sessions where required. Non-functional: throughput (100K–10M req/sec depending on L4/L7), latency overhead < 1ms for L4, < 5ms for L7 including TLS; availability 99.99% (requires active-active LB pair); observability (per-backend error rate, connection count, queue depth).
API & Entities
Listener: {protocol, port, default_target_group}. Target group: {backends[], health_check_config, routing_algorithm}. Routing rule: {priority, conditions: [path_pattern, header_match], target_group}. Health check: {path, interval_sec, healthy_threshold, unhealthy_threshold, timeout_sec}.
Core Design
Active-active LB pair behind Anycast VIP (or DNS-level GSLB). Each LB instance runs the routing algorithm against a real-time view of backend health state (maintained by independent health check threads). L7 LBs terminate TLS, inspect HTTP headers, and apply routing rules before forwarding.
Escalation (deep dives)
Slow-fail backends (not dead but slow) → P2C or least-latency algorithms; sticky sessions → consistent hashing tradeoffs; zero-downtime deploys → connection draining; global routing → GeoDNS vs Anycast; LB tier SPOF → active-active + health-based DNS failover.
Durability
LB configuration stored in control plane (etcd/ZooKeeper). Data plane (packet routing) is stateless and survives control plane outages. Health state is local to each LB instance — no shared state required for routing decisions. For session-sticky LBs, the session → backend mapping is replicated between the active-active pair.
L4 vs L7: The Fundamental Architecture Choice
L4 (Transport Layer) load balancing operates at the TCP/UDP level. It sees source IP, destination IP, and port — nothing about the application payload. Routing decisions are made by rewriting the destination IP in the TCP SYN packet (DSR mode) or establishing a new TCP connection to the backend (proxy mode). Because it never reads the HTTP headers, TLS session, or URL path, it is extremely fast: HAProxy in L4 mode handles 10M+ concurrent connections on commodity hardware. AWS NLB operates at L4.
L7 (Application Layer) load balancing terminates the TLS connection, reads the full HTTP request (headers, path, body if needed), applies routing rules, and establishes a new connection to the chosen backend. This requires significantly more CPU (TLS termination is expensive) but enables: path-based routing (/api → API cluster, /static → CDN origin), header-based routing (canary deployments via X-Canary: true), WebSocket upgrade handling, gRPC routing, and request-level observability. AWS ALB, Nginx, HAProxy in HTTP mode, and Envoy operate at L7.
The interview answer: use L4 for raw throughput (gaming, streaming, large-file transfer), use L7 for anything HTTP/gRPC where you need routing flexibility or observability. In practice, most internet-facing APIs use L7 with TLS termination at the LB; internal service-to-service traffic uses L4 or service mesh (Envoy/Linkerd) for lower overhead.
L4 vs L7 Load Balancing Architecture
Routing Algorithms: Which to Use and When
Round Robin assigns requests sequentially: request 1 → backend A, request 2 → backend B, request 3 → backend C, request 4 → backend A, etc. Simple and zero-state. Works well when all backends are identical and requests have similar processing time. Fails when: backends have different capacity (CPU/memory), requests have highly variable cost (mix of fast and slow queries), or any backend is slow (round-robin happily sends traffic to a 2000ms-latency backend at the same rate as a 5ms backend).
Weighted Round Robin adds a weight to each backend (backend A weight=2, backend B weight=1), so A gets 2x the traffic. Use for heterogeneous hardware, canary deployments (new version at weight=5%, old at weight=95%), and blue-green rollouts.
Least Connections routes each new request to the backend with the fewest active connections. Better than round-robin for variable-length requests (database queries, file uploads). Requires state: each LB instance must track connection counts per backend. Degrades when backends have different connection-handling throughput.
Power of Two Choices (P2C): randomly sample 2 backends from the pool, route to the one with fewer active connections. Statistically, P2C achieves near-optimal load distribution (within a log-log factor of optimal) with O(1) overhead and avoids the thundering herd problem where all LB instances simultaneously flood the "least loaded" backend. This is the algorithm used in Nginx's least_conn with random sampling, and it is the best general-purpose algorithm for stateless APIs.
Consistent Hashing: hash the request key (client IP, session ID, user ID) to a ring; always route the same key to the same backend. Required for: WebSocket servers (connection must stay on one backend), stateful game servers, and cache-warming scenarios where routing the same user to the same pod avoids cache misses. The downside: when backends are added or removed, ~1/N of keys remap (with virtual nodes to reduce hotspots). Not suitable for stateless APIs.
Routing Algorithm Selection Guide
| Algorithm | Latency Overhead | Best For | Avoid When |
|---|---|---|---|
| Round Robin | Zero state | Identical backends, uniform request cost | Variable request duration or heterogeneous hardware |
| Weighted Round Robin | Zero state | Canary deployments, heterogeneous hardware | When weights change frequently (config lag) |
| Least Connections | O(N) per request | Variable-duration requests (DB queries, uploads) | Very large backend pools (N > 1000) |
| Power of Two Choices (P2C) | O(1) — sample 2 | General-purpose stateless APIs — best default | Stateful connections requiring affinity |
| Consistent Hashing | O(log N) hash ring lookup | Stateful backends (WebSocket, game servers, caches) | Stateless APIs — adds affinity complexity unnecessarily |
| Random | Zero state | Testing, simple low-traffic systems | Any production system caring about tail latency |
Health Checks, Connection Draining, and Zero-Downtime Deploys
Health check design determines your failure detection window. A typical health check config: interval=10s, unhealthy_threshold=3, healthy_threshold=2, timeout=5s. This means: a backend that fails needs 3 consecutive failures × 10s interval = 30 seconds before it is removed from the pool. During those 30 seconds, the LB continues sending traffic to the failing backend. Design your retry budget accordingly: if health check detection is 30s, your clients must retry for at least 30s, or your SLO takes the hit.
Health check types by depth:
- TCP connect (L4): just verifies the port is accepting connections. Does not detect application-level failures (worker pool exhausted, DB connection pool full). Use as a baseline, not a sole health signal.
- HTTP 200 check (L7 shallow): GET
/healthreturns 200. Fast but still surface-level — a 200 can be returned while the handler is stuck behind a slow mutex. - Readiness probe (L7 deep):
/readyperforms a lightweight DB ping, cache ping, and dependency check. Returns 200 only when the backend can actually serve traffic. Use for Kubernetes readiness gates; use shallow health for liveness.
Connection draining (AWS: "deregistration delay") is critical for zero-downtime deploys. When a backend is removed from the pool (deploy, scale-down, or health failure), the LB: (1) immediately stops sending new requests to that backend, (2) waits up to drain_timeout (typically 30–60s) for in-flight requests to complete, then (3) forcefully closes remaining connections. Without connection draining, a deploy that removes a pod mid-request causes in-flight requests to fail with TCP RST. This is the most common source of deploy-time 502/503 errors in production.
Sticky sessions (session affinity) routes requests from the same client to the same backend for the duration of a session. Implemented via: a cookie inserted by the LB (AWSALB in AWS ALB, SERVERID in HAProxy), or consistent hashing on client IP (less reliable due to NAT/proxies). Sticky sessions create uneven load distribution and complicate deploys (draining a sticky backend strands all its sessions). Prefer stateless backends with session state in Redis; use sticky only when the application cannot be made stateless (legacy apps, WebSocket).
Global Load Balancing: GeoDNS vs Anycast
Global load balancing directs users to the nearest or healthiest data center before traffic hits your infrastructure.
GeoDNS returns different DNS A records based on the client's resolver location. A user in Europe resolves api.example.com to the EU region's IP; a user in Asia resolves to the APAC region's IP. Pros: simple to implement, works with any CDN. Cons: DNS TTL means failover takes 60–300 seconds (users cached the old IP); clients using non-local DNS resolvers (Google 8.8.8.8) are misrouted based on resolver location, not client location.
Anycast advertises the same IP block from multiple PoPs via BGP. Routers automatically send traffic to the topologically nearest PoP that announces the route. Failover is near-instant (BGP reconvergence in seconds, not DNS TTL minutes). Anycast is how Cloudflare, Google Cloud Load Balancing, and AWS Global Accelerator operate. It requires BGP infrastructure (AS number, IP blocks) — not available to most startups but standard at hyperscale.
Active-Active vs Active-Passive regions: active-active serves traffic from all regions simultaneously — lower latency but requires data replication to be read-correct in all regions. Active-passive keeps one region on standby; failover requires promoting the passive region, which adds latency to failover and requires careful data sync management. Prefer active-active for read-heavy workloads (most traffic is reads); active-passive for write-heavy workloads where conflict resolution is expensive.
Failure Modes to Call Out in Interviews
Round-robin to degraded backend: a backend that is slow (2000ms p99) but not dead passes TCP health checks. Round-robin sends it the same rate as healthy backends. The fix: use P2C (Power of Two Choices) which naturally routes less traffic to slow backends because they accumulate connections; or use health checks with latency thresholds (Envoy's outlier detection).
LB as single point of failure: a single LB instance fails → all traffic drops. Fix: active-active LB pair sharing an Anycast VIP or a floating IP (VRRP). AWS automatically runs ALB across multiple AZs.
Connection draining timeout too short: 10s drain timeout on an API with long-running requests (uploads, streaming) causes request abortion on every deploy. Tune drain timeout to p99 request duration + buffer.
Thundering herd on backend recovery: when a backend recovers and is re-added to the pool, if the algorithm is round-robin, it gets a full share of traffic immediately — possibly crashing a just-recovered instance. Use slow-start (gradual traffic warm-up over 30–60s) in Nginx upstream blocks or AWS ALB slow-start mode.
Load Balancer Technology Comparison
| Technology | Layer | Throughput | Best For | Key Limitation |
|---|---|---|---|---|
| AWS NLB | L4 | Millions of req/sec | TCP/UDP, gaming, streaming, static IP requirement | No HTTP-aware routing, no host-based rules |
| AWS ALB | L7 | ~1M req/sec (managed) | HTTP/HTTPS APIs, gRPC, WebSockets, path routing | Latency overhead for TLS termination; no BGP Anycast |
| HAProxy | L4 + L7 | 10M+ connections (L4) | On-prem, custom routing logic, TCP proxying | Requires manual HA setup; configuration complexity |
| Nginx | L7 (+ L4 stream) | ~100K req/sec/core | API gateway, reverse proxy, TLS offload | Less expressive routing than Envoy; manual reload for config changes |
| Envoy | L7 | ~100K req/sec/core | Service mesh, dynamic config, advanced observability | Operational complexity; xDS control plane required |
| Cloudflare / AWS Global Accelerator | Global L4/L7 | Multi-Tbps at edge | Global Anycast, DDoS protection, latency routing | Cost; not for internal traffic |
Interview Summary — What to Say
Open with the L4 vs L7 question: "Before choosing an algorithm, I need to know if we're routing HTTP/HTTPS or raw TCP. L7 costs more CPU for TLS + parsing but gives routing flexibility. L4 handles 10× the connections at wire speed."
For algorithm selection: "My default is Power of Two Choices for stateless APIs — it statistically outperforms least-connections with O(1) overhead. I use consistent hashing only for stateful backends like WebSocket servers."
For health checks: "Detection window is interval × unhealthy_threshold. With 10s interval and 3 strikes, a failed backend gets traffic for 30 seconds. My client retry budget must cover this window."
For HA: "The LB itself cannot be a SPOF. Active-active pair with a shared Anycast VIP or floating IP, with health-based failover at the DNS or BGP level."
Staff-level add: "For global routing, GeoDNS has 60–300s failover lag due to TTL. Anycast (BGP) reconverges in seconds — that is what Cloudflare and Google Cloud LB use. For most products, GeoDNS + short TTL (60s) is sufficient."
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →