How to Design at HLD: From Blank Whiteboard to Defensible Architecture
The mechanical playbook for HLD interview execution. Covers the API-first design rule, capacity-driven technology selection, the five-layer reference architecture, sharding decision tree, caching strategy, and the patterns FAANG candidates use to deliver consistently strong system designs.
Design Is a Mechanical Skill — Treat It That Way
Strong HLD candidates do not invent architectures from scratch in 45 minutes. They run a playbook: a sequence of patterns and templates that produce defensible designs reliably. The skill is not raw creativity — it is knowing which pattern fits which constraint, and assembling the patterns under time pressure.
This page is the mechanical playbook. It assumes you already know how to approach the interview (covered on the companion page) and focuses on what to draw and why.
The rule that organizes everything: constraints drive design, not the other way around. Read QPS drives caching. Write QPS drives the database engine choice. Storage size drives sharding. Latency budget drives CDN, replication strategy, and async vs sync paths. If you cannot tie a component to a specific number, that component should not be in your design.
The second rule: API first, data model second, components third. Most candidates draw boxes first and figure out what they do later. The reverse is correct — define the external contract (what the system promises), then the data model that supports it, then the components that implement it. This order eliminates 70% of the inconsistencies that show up in sloppy designs.
The 6-Phase Design Playbook
Phase 1 — Translate requirements into 3 numbers (5 min)
Read QPS · Write QPS · Storage. Every architecture decision references these. Compute average and peak (3-5x). Without these, every choice is a guess.
Phase 2 — Define the API contract (5 min)
List 3-5 core endpoints with method + path + request shape + response shape + latency budget. The API constrains everything downstream — if 'GET /feed' must return < 200ms, the data model must support it.
Phase 3 — Design the data model (5 min)
Entities + relationships + access patterns. ERD-level — not full schema. State the access pattern explicitly: 'lookup user_id → list of posts' tells you whether you need a relational join, a wide-column scan, or a key-value lookup.
Phase 4 — Lay down the 5-layer reference (5 min)
Client → Edge (CDN, LB) → Service → Cache + Data → Async. This is the skeleton 90% of HLD architectures resolve to. Customize by adding what's required and removing what isn't.
Phase 5 — Apply the sharding + caching decision trees (10 min)
Sharding: pick a shard key and justify (user_id for user-partitioned data; geohash for location data; tenant_id for B2B). Caching: pick the strategy (cache-aside, write-through, refresh-ahead) per data type and justify with the staleness tolerance.
Phase 6 — Deep dive on the 2 hardest components (10 min)
The hardest components are usually the ones with consistency, ordering, or scale tradeoffs. Pick two and walk through: data structure, algorithm, failure mode, scaling cliff.
The 5-Layer Reference Architecture (Customize, Don't Recreate)
API-First Design Eliminates 70% of Bugs
Before drawing a single component box, write down 3-5 API endpoints with their full contract:
POST /api/v1/posts
body: { user_id, content, media_url? }
returns: { post_id, created_at }
SLA: P99 < 300ms, durably persisted before response
GET /api/v1/feed?user_id=X&cursor=Y&limit=20
returns: { posts[], next_cursor, has_more }
SLA: P99 < 200ms, eventual consistency OK (≤ 30s lag)
The act of writing the contract forces three decisions: (1) what does each operation produce or consume; (2) what's the latency budget; (3) what's the consistency requirement. With those three, every downstream architectural choice has a forcing function. Without them, you're guessing.
Most candidates skip API-first because it feels like 'just paperwork.' The cost: 30 minutes later, when the interviewer asks 'how does feed pagination work,' you realize your timeline data structure can't support cursor pagination efficiently — and you have to re-design the whole storage layer.
Capacity-Driven Technology Selection
Step 1 — Identify the dominant operation
Is this a read-heavy system (1000:1 reads:writes — feeds, search), write-heavy (events, logs), or balanced (transactions)? The dominant operation determines the storage engine class.
Step 2 — Match the read pattern to the storage
Key-value lookup → Redis or DynamoDB. Range scan by time → Cassandra, ClickHouse, or InfluxDB. Relational join → Postgres or MySQL. Full-text → Elasticsearch. Graph traversal → Neo4j. Document → MongoDB. Mismatch is the most common architectural mistake.
Step 3 — Match the write pattern to the engine
B-tree (Postgres, MySQL) handles ~5K writes/sec per node — good for moderate write rates and complex queries. LSM-tree (Cassandra, RocksDB-based) handles ~50K+ writes/sec — good for high write throughput at the cost of read amplification. Append-only log (Kafka, Pulsar) handles ~1M+ events/sec — good for event streams, not point queries.
Step 4 — Pick the consistency model
Strong consistency (Postgres single-leader, Spanner) — for financial transactions, account state. Linearizability per key (DynamoDB strong-read, Cassandra LWT) — for inventory, user state. Eventual consistency (Cassandra default, DynamoDB default) — for feeds, recommendations, analytics. Mismatching consistency to need is the second most common mistake.
Step 5 — Pick the replication topology
Single leader + sync replicas (Postgres, MySQL) — fastest writes, simple, but leader is the SPOF. Multi-leader (Cassandra, DynamoDB Global Tables) — multi-region writes, eventually consistent, complex conflict resolution. Leaderless quorum (Cassandra default) — tunable consistency, no single point of failure.
Step 6 — Confirm with a back-of-envelope check
Does the chosen technology actually handle the rate? Postgres at 100K writes/sec with no sharding is a wrong answer. Verify: writes/sec × replication factor < node throughput × cluster size. If not, sharding or a different engine.
Database Selection Decision Matrix
| Use case | Engine | Why | When NOT to use |
|---|---|---|---|
| Transactional data with joins (orders, accounts) | PostgreSQL / MySQL | ACID, rich query language, mature ecosystem, B-tree index for point + range | Write rate > 50K/sec without sharding; data > 10TB without partitioning |
| High-write key-value (event log, time-series messages) | Cassandra / ScyllaDB | LSM-tree write throughput, leaderless multi-master, linear scale-out | Complex relational queries; transactional integrity across rows |
| Sub-ms key-value lookup, session, leaderboard | Redis | In-memory, sorted sets, atomic INCR, pub/sub | Data does not fit in RAM ($$$); strong durability requirement |
| Document store with flexible schema | MongoDB / DocumentDB | Schemaless, indexed nested fields, change streams | Highly relational data; cross-document transactions at scale |
| Wide-column key-value with predictable scale | DynamoDB / Bigtable | Managed, autoscaling, single-digit ms reads, GSIs | Complex queries; cost at very high read QPS (>1M reads/sec) |
| Full-text search, faceted queries | Elasticsearch / OpenSearch | Inverted index, BM25 scoring, aggregations | Source of truth (use as a denormalized index, not primary store) |
| Graph traversal (social, recommendations) | Neo4j / Neptune / Dgraph | Native graph storage, multi-hop traversal queries | Workloads that fit relational joins ≤ 2-3 levels deep |
| Time-series metrics | InfluxDB / TimescaleDB / VictoriaMetrics | Compression, retention policies, downsampling | OLTP workloads; complex relational queries |
| Analytics / OLAP | ClickHouse / BigQuery / Snowflake | Columnar storage, parallel scans, sub-second on TB | OLTP; per-row reads or updates |
| Blob storage (images, video, exports) | S3 / GCS / Azure Blob | Cheap, durable, CDN-ready, lifecycle policies | Frequent small reads where latency matters (use Redis or DB) |
Sharding Decision Tree — Pick a Shard Key You Can Defend
Step 1 — Identify the dominant access pattern
Most reads come from one entity ID? Shard by that ID. Examples: user_id for social media, tenant_id for B2B SaaS, geohash for location-based services. Mismatch causes scatter-gather queries that defeat sharding.
Step 2 — Check for hot keys
Will any single key receive 10x average traffic? Celebrity user_id, popular product_id. If yes, hot key isolation: dedicate a separate shard or cache tier for the top 0.1%. Or pre-shard hot keys: user_id::shard_0..N.
Step 3 — Pick range vs hash sharding
Hash sharding (consistent hashing, jump hash) → uniform load distribution, hard to do range queries. Range sharding (sorted by key) → range queries are fast, hot ranges cause imbalance. Default: hash. Use range only when range queries are the dominant pattern (time-series).
Step 4 — Plan for resharding
Resharding is painful — design to minimize the need. Use consistent hashing (only K/N keys move on a node addition, vs all keys with naive modulo). Use virtual nodes (each physical node owns 100-256 virtual nodes) for finer-grained rebalancing.
Step 5 — Handle cross-shard transactions
Most cross-shard queries kill sharding's benefits. If unavoidable: 2-phase commit (slow, complex), saga pattern (eventual, complex compensations), or denormalize so the cross-shard query becomes a single-shard query (most common).
Sharding Decision Tree (Visual)
Caching Strategy — Pick the Right Pattern Per Data Type
Cache-aside (lazy loading) — the default
App reads from cache first; on miss, reads DB and writes to cache. App writes to DB and invalidates cache. Pros: simple, only-cached-what's-needed. Cons: cache miss spike on cold start (thundering herd). Use for: most read-heavy data (user profiles, product catalog).
Write-through — strong consistency at write cost
App writes to cache and DB synchronously (cache update is part of the write path). Pros: reads always hot and consistent. Cons: writes are slower; cache stores everything written, even rarely-read items. Use for: read-after-write patterns where stale reads are unacceptable (account balances, settings).
Write-behind (write-back) — write throughput
App writes to cache; cache asynchronously persists to DB. Pros: very high write throughput. Cons: data loss on cache failure; complex consistency. Use for: high-write systems where a small data loss window is acceptable (analytics counters, view counts).
Refresh-ahead — predictable hit rate
Cache proactively refreshes entries before TTL expires. Pros: consistent hit rate, no cold-start spikes. Cons: refreshes data that may not be read; overhead. Use for: known-hot data with predictable access patterns (top-N feeds, trending items).
Read-through — abstraction simplification
Cache itself loads from DB on miss (app sees only cache). Pros: simple app code. Cons: cache becomes critical path. Use for: managed cache services (DAX, ElastiCache), where the cache abstraction handles miss path.
Cache Failure Modes and Mitigations
| Failure mode | What happens | Mitigation |
|---|---|---|
| Cold start (cache empty) | All reads miss → DB load spikes 10-100x | Pre-warm cache with top-N keys before opening traffic; gradual ramp-up |
| Thundering herd (many misses for same key) | 100 concurrent miss → 100 DB queries for the same row | Single-flight lock per key (only first request hits DB; others wait); request coalescing |
| Cache eviction (memory pressure) | LRU evicts hot keys → miss rate climbs → DB load climbs | Size cache to fit working set; promote critical keys with no-evict policy; use multi-tier cache |
| Cache cluster failure | Entire cache layer down → 100% miss → DB at full read load | Pre-provision read replicas to absorb 5-10× read load; circuit breaker; degrade gracefully (serve stale) |
| Stale cache (write missed invalidation) | Reads return outdated data | Short TTL as fallback; double-write to cache + DB with retry; event-driven invalidation via CDC |
| Hot key (single key gets 10× load) | Single Redis shard saturated → P99 latency spikes | Pre-shard the hot key (user::123::shard_0..7); local in-process cache for hottest top-100 |
The Cache Hit Rate Math You Must Know
Caching is not a feature — it is the load-bearing infrastructure that makes most read-heavy systems feasible. The math:
- Database handles ~5K reads/sec per node (complex queries) to ~50K reads/sec per node (simple key lookup).
- Redis handles ~100K reads/sec per node (simple key lookup).
- At 100K reads/sec, a single Redis node beats a 10-node DB read-replica cluster.
Now the load-relief math: at 100K reads/sec with 95% cache hit rate, the DB sees 100K × 5% = 5K reads/sec — handleable by one Postgres node. At 90% hit rate, the DB sees 10K reads/sec — needs replicas. At 80% hit rate, 20K — needs sharding.
Implication: cache hit rate is not a soft KPI. A 5% drop in hit rate (95% → 90%) doubles the DB load. Always design for ≥ 95% hit rate on hot paths and monitor the rate in production. If the hit rate drops, the DB tier collapses next.
Cache hit rate is achieved by: (1) sufficient cache memory to hold the working set; (2) appropriate TTL (long enough to capture re-reads, short enough to limit staleness); (3) appropriate eviction policy (LRU for Zipf-distributed access, LFU when access frequency is bimodal).
Patterns for Async, Eventual Consistency, and Failure Handling
Outbox pattern — atomic write + publish
Atomically write the change to the DB AND an 'outbox' table within the same transaction. A separate process reads outbox, publishes to Kafka, marks as published. Solves the dual-write problem (DB write succeeds but message publish fails).
Saga pattern — long-running cross-service transactions
Split a multi-step operation into per-service local transactions, each with a compensation action. On step N failure, run compensations N-1..1 in reverse. Use for: order placement (reserve inventory → charge card → notify shipping); each step is local + compensatable.
CQRS — separate read and write models
Write model optimized for transactional updates (Postgres). Read model optimized for queries (Elasticsearch, denormalized Cassandra). CDC pipeline (Debezium → Kafka → projection service) keeps them in sync. Use when read and write patterns diverge.
Idempotency keys — safe retries
Client generates a unique idempotency key per logical operation. Server stores the key + result for 24h. Repeat requests with the same key return the stored result. Required for: any non-trivial mutation that can be retried (payments, orders).
Circuit breaker — fail fast
Track downstream error rate. When error rate > 50% for 1 minute, open the circuit (immediately fail without calling). After 30 seconds, half-open (try one request). Closes when success rate recovers. Prevents cascade failure during downstream outages.
Bulkhead — failure isolation
Isolate resource pools so failure in one feature does not exhaust resources for others. Example: separate thread pools for high-priority API calls and low-priority background tasks. The Titanic sank because a single hull breach flooded all compartments — bulkheads prevent this in software.
Worked Example: Designing a Feed Service from Scratch
Requirements: 200M DAU, 4K posts/sec average (15K peak), 140K timeline reads/sec average (450K peak), P99 timeline read < 200ms, eventual consistency OK (lag ≤ 30s).
Phase 1 — Numbers: read QPS 140K, write QPS 4K, storage 200GB/day = 73TB/year.
Phase 2 — API: POST /posts (body: user_id, content, media_url; response: post_id, created_at). GET /feed?user_id=X&cursor=Y&limit=20 (response: posts[], next_cursor).
Phase 3 — Data model: Post(post_id, user_id, content, media_url, created_at). Follow(follower_id, followee_id, created_at). Timeline(user_id, post_id, created_at) — denormalized, pre-computed.
Phase 4 — 5-layer reference: client → CDN (media) + ALB → API gateway → feed service + post service → Redis (timeline cache) + Cassandra (post store + timeline store) → Kafka (post events) → fan-out workers.
Phase 5 — Sharding + caching: Shard posts by user_id (consistent hashing). Cache top-N user timelines in Redis (24h TTL, refresh-ahead). Cache hit rate target 95%.
Phase 6 — Hardest components:
Fan-out service: on POST, publish to Kafka. Fan-out worker reads, looks up follower list, writes to each follower's timeline shard. For users with < 100K followers, fan-out on write (cheap reads). For celebrities with > 100K followers, hybrid: fan-out on write to active followers, fan-out on read for the rest. Estimated fan-out latency: 100ms P50, 5s P99 for celebrities.
Timeline cache: Redis sorted set per user, scored by created_at. ZRANGEBYSCORE for cursor pagination. Cache the most recent 1K posts per active user. On cache miss, read from Cassandra timeline shard. Cassandra timeline schema: PRIMARY KEY (user_id, created_at DESC) — natural ordering, single-shard read.
Failure modes: Kafka outage → posts queue locally with retry; once Kafka recovers, drain. Fan-out worker lag → monitor Kafka consumer lag; auto-scale workers. Cache cluster failure → fall back to Cassandra (pre-provisioned for 5x read load), serve degraded latency until cache recovers.
The full design takes 35 minutes if you run the playbook. Without the playbook, it takes 50 minutes and produces inconsistencies.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →