Sections

0/13

Related Guides

HLD Interview Framework: CIRCLE Method

High-Level Design

35m

How to Approach an HLD System Design Interview

High-Level Design

30m

Caching: Strategy, Redis Internals & Distributed Patterns

High-Level Design

45m

Databases: Sharding, Indexing & Replication

High-Level Design

50m

Quiz

← Back to Library

High-Level Design·Intermediate

How to Design at HLD: From Blank Whiteboard to Defensible Architecture

The mechanical playbook for HLD interview execution. Covers the API-first design rule, capacity-driven technology selection, the five-layer reference architecture, sharding decision tree, caching strategy, and the patterns FAANG candidates use to deliver consistently strong system designs.

40 min read 13 sections 6 interview questions

HLD DesignSystem DesignAPI DesignSharding StrategyCaching StrategyDatabase SelectionLoad BalancingMessage QueueCAP TheoremConsistent HashingData ModelingCapacity Planning

Design Is a Mechanical Skill — Treat It That Way

Strong HLD candidates do not invent architectures from scratch in 45 minutes. They run a playbook: a sequence of patterns and templates that produce defensible designs reliably. The skill is not raw creativity — it is knowing which pattern fits which constraint, and assembling the patterns under time pressure.

This page is the mechanical playbook. It assumes you already know how to approach the interview (covered on the companion page) and focuses on what to draw and why.

The rule that organizes everything: constraints drive design, not the other way around. Read QPS drives caching. Write QPS drives the database engine choice. Storage size drives sharding. Latency budget drives CDN, replication strategy, and async vs sync paths. If you cannot tie a component to a specific number, that component should not be in your design.

The second rule: API first, data model second, components third. Most candidates draw boxes first and figure out what they do later. The reverse is correct — define the external contract (what the system promises), then the data model that supports it, then the components that implement it. This order eliminates 70% of the inconsistencies that show up in sloppy designs.

The 6-Phase Design Playbook

Phase 1 — Translate requirements into 3 numbers (5 min)

Read QPS · Write QPS · Storage. Every architecture decision references these. Compute average and peak (3-5x). Without these, every choice is a guess.

Phase 2 — Define the API contract (5 min)

List 3-5 core endpoints with method + path + request shape + response shape + latency budget. The API constrains everything downstream — if 'GET /feed' must return < 200ms, the data model must support it.

Phase 3 — Design the data model (5 min)

Entities + relationships + access patterns. ERD-level — not full schema. State the access pattern explicitly: 'lookup user_id → list of posts' tells you whether you need a relational join, a wide-column scan, or a key-value lookup.

Phase 4 — Lay down the 5-layer reference (5 min)

Client → Edge (CDN, LB) → Service → Cache + Data → Async. This is the skeleton 90% of HLD architectures resolve to. Customize by adding what's required and removing what isn't.

Phase 5 — Apply the sharding + caching decision trees (10 min)

Sharding: pick a shard key and justify (user_id for user-partitioned data; geohash for location data; tenant_id for B2B). Caching: pick the strategy (cache-aside, write-through, refresh-ahead) per data type and justify with the staleness tolerance.

Phase 6 — Deep dive on the 2 hardest components (10 min)

The hardest components are usually the ones with consistency, ordering, or scale tradeoffs. Pick two and walk through: data structure, algorithm, failure mode, scaling cliff.

The 5-Layer Reference Architecture (Customize, Don't Recreate)

Rendering diagram...

IMPORTANT

API-First Design Eliminates 70% of Bugs

Before drawing a single component box, write down 3-5 API endpoints with their full contract:

POST /api/v1/posts
  body: { user_id, content, media_url? }
  returns: { post_id, created_at }
  SLA: P99 < 300ms, durably persisted before response

GET /api/v1/feed?user_id=X&cursor=Y&limit=20
  returns: { posts[], next_cursor, has_more }
  SLA: P99 < 200ms, eventual consistency OK (≤ 30s lag)

The act of writing the contract forces three decisions: (1) what does each operation produce or consume; (2) what's the latency budget; (3) what's the consistency requirement. With those three, every downstream architectural choice has a forcing function. Without them, you're guessing.

Most candidates skip API-first because it feels like 'just paperwork.' The cost: 30 minutes later, when the interviewer asks 'how does feed pagination work,' you realize your timeline data structure can't support cursor pagination efficiently — and you have to re-design the whole storage layer.

Capacity-Driven Technology Selection

Step 1 — Identify the dominant operation

Is this a read-heavy system (1000:1 reads:writes — feeds, search), write-heavy (events, logs), or balanced (transactions)? The dominant operation determines the storage engine class.

Step 2 — Match the read pattern to the storage

Key-value lookup → Redis or DynamoDB. Range scan by time → Cassandra, ClickHouse, or InfluxDB. Relational join → Postgres or MySQL. Full-text → Elasticsearch. Graph traversal → Neo4j. Document → MongoDB. Mismatch is the most common architectural mistake.

Step 3 — Match the write pattern to the engine

B-tree (Postgres, MySQL) handles ~5K writes/sec per node — good for moderate write rates and complex queries. LSM-tree (Cassandra, RocksDB-based) handles ~50K+ writes/sec — good for high write throughput at the cost of read amplification. Append-only log (Kafka, Pulsar) handles ~1M+ events/sec — good for event streams, not point queries.

Step 4 — Pick the consistency model

Strong consistency (Postgres single-leader, Spanner) — for financial transactions, account state. Linearizability per key (DynamoDB strong-read, Cassandra LWT) — for inventory, user state. Eventual consistency (Cassandra default, DynamoDB default) — for feeds, recommendations, analytics. Mismatching consistency to need is the second most common mistake.

Step 5 — Pick the replication topology

Single leader + sync replicas (Postgres, MySQL) — fastest writes, simple, but leader is the SPOF. Multi-leader (Cassandra, DynamoDB Global Tables) — multi-region writes, eventually consistent, complex conflict resolution. Leaderless quorum (Cassandra default) — tunable consistency, no single point of failure.

Step 6 — Confirm with a back-of-envelope check

Does the chosen technology actually handle the rate? Postgres at 100K writes/sec with no sharding is a wrong answer. Verify: writes/sec × replication factor < node throughput × cluster size. If not, sharding or a different engine.

Database Selection Decision Matrix

Use case	Engine	Why	When NOT to use
Transactional data with joins (orders, accounts)	PostgreSQL / MySQL	ACID, rich query language, mature ecosystem, B-tree index for point + range	Write rate > 50K/sec without sharding; data > 10TB without partitioning
High-write key-value (event log, time-series messages)	Cassandra / ScyllaDB	LSM-tree write throughput, leaderless multi-master, linear scale-out	Complex relational queries; transactional integrity across rows
Sub-ms key-value lookup, session, leaderboard	Redis	In-memory, sorted sets, atomic INCR, pub/sub	Data does not fit in RAM ($$$); strong durability requirement
Document store with flexible schema	MongoDB / DocumentDB	Schemaless, indexed nested fields, change streams	Highly relational data; cross-document transactions at scale
Wide-column key-value with predictable scale	DynamoDB / Bigtable	Managed, autoscaling, single-digit ms reads, GSIs	Complex queries; cost at very high read QPS (>1M reads/sec)
Full-text search, faceted queries	Elasticsearch / OpenSearch	Inverted index, BM25 scoring, aggregations	Source of truth (use as a denormalized index, not primary store)
Graph traversal (social, recommendations)	Neo4j / Neptune / Dgraph	Native graph storage, multi-hop traversal queries	Workloads that fit relational joins ≤ 2-3 levels deep
Time-series metrics	InfluxDB / TimescaleDB / VictoriaMetrics	Compression, retention policies, downsampling	OLTP workloads; complex relational queries
Analytics / OLAP	ClickHouse / BigQuery / Snowflake	Columnar storage, parallel scans, sub-second on TB	OLTP; per-row reads or updates
Blob storage (images, video, exports)	S3 / GCS / Azure Blob	Cheap, durable, CDN-ready, lifecycle policies	Frequent small reads where latency matters (use Redis or DB)

Sharding Decision Tree — Pick a Shard Key You Can Defend

Step 1 — Identify the dominant access pattern

Most reads come from one entity ID? Shard by that ID. Examples: user_id for social media, tenant_id for B2B SaaS, geohash for location-based services. Mismatch causes scatter-gather queries that defeat sharding.

Step 2 — Check for hot keys

Will any single key receive 10x average traffic? Celebrity user_id, popular product_id. If yes, hot key isolation: dedicate a separate shard or cache tier for the top 0.1%. Or pre-shard hot keys: user_id::shard_0..N.

Step 3 — Pick range vs hash sharding

Hash sharding (consistent hashing, jump hash) → uniform load distribution, hard to do range queries. Range sharding (sorted by key) → range queries are fast, hot ranges cause imbalance. Default: hash. Use range only when range queries are the dominant pattern (time-series).

Step 4 — Plan for resharding

Resharding is painful — design to minimize the need. Use consistent hashing (only K/N keys move on a node addition, vs all keys with naive modulo). Use virtual nodes (each physical node owns 100-256 virtual nodes) for finer-grained rebalancing.

Step 5 — Handle cross-shard transactions

Most cross-shard queries kill sharding's benefits. If unavoidable: 2-phase commit (slow, complex), saga pattern (eventual, complex compensations), or denormalize so the cross-shard query becomes a single-shard query (most common).

Sharding Decision Tree (Visual)

Rendering diagram...

Caching Strategy — Pick the Right Pattern Per Data Type

Cache-aside (lazy loading) — the default

App reads from cache first; on miss, reads DB and writes to cache. App writes to DB and invalidates cache. Pros: simple, only-cached-what's-needed. Cons: cache miss spike on cold start (thundering herd). Use for: most read-heavy data (user profiles, product catalog).

Write-through — strong consistency at write cost

App writes to cache and DB synchronously (cache update is part of the write path). Pros: reads always hot and consistent. Cons: writes are slower; cache stores everything written, even rarely-read items. Use for: read-after-write patterns where stale reads are unacceptable (account balances, settings).

Write-behind (write-back) — write throughput

App writes to cache; cache asynchronously persists to DB. Pros: very high write throughput. Cons: data loss on cache failure; complex consistency. Use for: high-write systems where a small data loss window is acceptable (analytics counters, view counts).

Refresh-ahead — predictable hit rate

Cache proactively refreshes entries before TTL expires. Pros: consistent hit rate, no cold-start spikes. Cons: refreshes data that may not be read; overhead. Use for: known-hot data with predictable access patterns (top-N feeds, trending items).

Read-through — abstraction simplification

Cache itself loads from DB on miss (app sees only cache). Pros: simple app code. Cons: cache becomes critical path. Use for: managed cache services (DAX, ElastiCache), where the cache abstraction handles miss path.

Cache Failure Modes and Mitigations

Failure mode	What happens	Mitigation
Cold start (cache empty)	All reads miss → DB load spikes 10-100x	Pre-warm cache with top-N keys before opening traffic; gradual ramp-up
Thundering herd (many misses for same key)	100 concurrent miss → 100 DB queries for the same row	Single-flight lock per key (only first request hits DB; others wait); request coalescing
Cache eviction (memory pressure)	LRU evicts hot keys → miss rate climbs → DB load climbs	Size cache to fit working set; promote critical keys with no-evict policy; use multi-tier cache
Cache cluster failure	Entire cache layer down → 100% miss → DB at full read load	Pre-provision read replicas to absorb 5-10× read load; circuit breaker; degrade gracefully (serve stale)
Stale cache (write missed invalidation)	Reads return outdated data	Short TTL as fallback; double-write to cache + DB with retry; event-driven invalidation via CDC
Hot key (single key gets 10× load)	Single Redis shard saturated → P99 latency spikes	Pre-shard the hot key (user::123::shard_0..7); local in-process cache for hottest top-100

⚠ WARNING

The Cache Hit Rate Math You Must Know

Caching is not a feature — it is the load-bearing infrastructure that makes most read-heavy systems feasible. The math:

Database handles ~5K reads/sec per node (complex queries) to ~50K reads/sec per node (simple key lookup).
Redis handles ~100K reads/sec per node (simple key lookup).
At 100K reads/sec, a single Redis node beats a 10-node DB read-replica cluster.

Now the load-relief math: at 100K reads/sec with 95% cache hit rate, the DB sees 100K × 5% = 5K reads/sec — handleable by one Postgres node. At 90% hit rate, the DB sees 10K reads/sec — needs replicas. At 80% hit rate, 20K — needs sharding.

Implication: cache hit rate is not a soft KPI. A 5% drop in hit rate (95% → 90%) doubles the DB load. Always design for ≥ 95% hit rate on hot paths and monitor the rate in production. If the hit rate drops, the DB tier collapses next.

Cache hit rate is achieved by: (1) sufficient cache memory to hold the working set; (2) appropriate TTL (long enough to capture re-reads, short enough to limit staleness); (3) appropriate eviction policy (LRU for Zipf-distributed access, LFU when access frequency is bimodal).

Patterns for Async, Eventual Consistency, and Failure Handling

Outbox pattern — atomic write + publish

Atomically write the change to the DB AND an 'outbox' table within the same transaction. A separate process reads outbox, publishes to Kafka, marks as published. Solves the dual-write problem (DB write succeeds but message publish fails).

Saga pattern — long-running cross-service transactions

Split a multi-step operation into per-service local transactions, each with a compensation action. On step N failure, run compensations N-1..1 in reverse. Use for: order placement (reserve inventory → charge card → notify shipping); each step is local + compensatable.

CQRS — separate read and write models

Write model optimized for transactional updates (Postgres). Read model optimized for queries (Elasticsearch, denormalized Cassandra). CDC pipeline (Debezium → Kafka → projection service) keeps them in sync. Use when read and write patterns diverge.

Idempotency keys — safe retries

Client generates a unique idempotency key per logical operation. Server stores the key + result for 24h. Repeat requests with the same key return the stored result. Required for: any non-trivial mutation that can be retried (payments, orders).

Circuit breaker — fail fast

Track downstream error rate. When error rate > 50% for 1 minute, open the circuit (immediately fail without calling). After 30 seconds, half-open (try one request). Closes when success rate recovers. Prevents cascade failure during downstream outages.

Bulkhead — failure isolation

Isolate resource pools so failure in one feature does not exhaust resources for others. Example: separate thread pools for high-priority API calls and low-priority background tasks. The Titanic sank because a single hull breach flooded all compartments — bulkheads prevent this in software.

EXAMPLE

Worked Example: Designing a Feed Service from Scratch

Requirements: 200M DAU, 4K posts/sec average (15K peak), 140K timeline reads/sec average (450K peak), P99 timeline read < 200ms, eventual consistency OK (lag ≤ 30s).

Phase 1 — Numbers: read QPS 140K, write QPS 4K, storage 200GB/day = 73TB/year.

Phase 2 — API: POST /posts (body: user_id, content, media_url; response: post_id, created_at). GET /feed?user_id=X&cursor=Y&limit=20 (response: posts[], next_cursor).

Phase 3 — Data model: Post(post_id, user_id, content, media_url, created_at). Follow(follower_id, followee_id, created_at). Timeline(user_id, post_id, created_at) — denormalized, pre-computed.

Phase 4 — 5-layer reference: client → CDN (media) + ALB → API gateway → feed service + post service → Redis (timeline cache) + Cassandra (post store + timeline store) → Kafka (post events) → fan-out workers.

Phase 5 — Sharding + caching: Shard posts by user_id (consistent hashing). Cache top-N user timelines in Redis (24h TTL, refresh-ahead). Cache hit rate target 95%.

Phase 6 — Hardest components:

Fan-out service: on POST, publish to Kafka. Fan-out worker reads, looks up follower list, writes to each follower's timeline shard. For users with < 100K followers, fan-out on write (cheap reads). For celebrities with > 100K followers, hybrid: fan-out on write to active followers, fan-out on read for the rest. Estimated fan-out latency: 100ms P50, 5s P99 for celebrities.

Timeline cache: Redis sorted set per user, scored by created_at. ZRANGEBYSCORE for cursor pagination. Cache the most recent 1K posts per active user. On cache miss, read from Cassandra timeline shard. Cassandra timeline schema: PRIMARY KEY (user_id, created_at DESC) — natural ordering, single-shard read.

Failure modes: Kafka outage → posts queue locally with retry; once Kafka recovers, drain. Fan-out worker lag → monitor Kafka consumer lag; auto-scale workers. Cache cluster failure → fall back to Cassandra (pre-provisioned for 5x read load), serve degraded latency until cache recovers.

The full design takes 35 minutes if you run the playbook. Without the playbook, it takes 50 minutes and produces inconsistencies.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.