Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Databases: Sharding, Indexing & Replication
Database engineering for large-scale systems — the most tested HLD sub-topic after caching. Covers B-Tree and LSM-Tree storage engines, indexing strategies (covering, composite, partial), sharding strategies and hotspot handling, replication (sync vs async, leader election), and how to choose between SQL and NoSQL in system design interviews.
Storage Engines — How Databases Actually Store Data
Two dominant storage engine paradigms power most databases in production.
B-Tree (PostgreSQL, MySQL, SQLite) — data organized as a balanced tree of fixed-size pages (4–8 KB each). Pages hold sorted key-value pairs; writes update pages in-place.
- Reads: O(log n) — traverse from root to leaf page
- Writes: O(log n) — locate page, write to WAL first, then update in-place
- Strengths: fast point and range queries, ideal for OLTP read/write balance
- Weaknesses: random writes cause write amplification and disk seeks on HDDs
LSM-Tree (Cassandra, RocksDB, LevelDB, ClickHouse) — Log-Structured Merge tree. All writes land in an in-memory MemTable, flushed to immutable SSTables on disk when full. A background compaction process merges SSTables, purging duplicates and tombstones.
- Reads: check MemTable + all SSTables; Bloom filters eliminate most unnecessary lookups
- Writes: always sequential — no random I/O → extremely high write throughput
- Strengths: write-optimized, perfect for time-series, logs, and append-heavy workloads
- Weaknesses: read amplification — multiple SSTables must be checked per read
B-Tree vs LSM-Tree at a Glance
| Aspect | B-Tree (PostgreSQL / MySQL) | LSM-Tree (Cassandra / RocksDB) |
|---|---|---|
| Write pattern | Random in-place page writes | Sequential writes (MemTable → SSTables) |
| Read pattern | O(log n) single lookup | MemTable + multiple SSTables (mitigated by Bloom filters) |
| Write amplification | Low–Medium (WAL + page update) | High (data written multiple times through compaction levels) |
| Read amplification | Low (1–2 disk reads typically) | Moderate (several SSTables checked per read) |
| Space amplification | Low–Medium (dead rows via MVCC, vacuumed) | Moderate (compaction lag leaves duplicates temporarily) |
| Best workload | OLTP: balanced read/write, complex queries | Write-heavy: time-series, logs, IoT, large-scale append |
| Examples | PostgreSQL, MySQL, SQLite, Oracle | Cassandra, ScyllaDB, RocksDB, LevelDB, ClickHouse (parts) |