Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Design Google Docs (Real-Time Collaborative Editing)

System design deep-dive for real-time collaborative editing at 1B+ scale. Covers OT vs CRDTs, why Google Docs uses OT with a centralized server, WebSocket sync, offline editing with Yjs, and the tombstone problem that limits CRDT scalability.

55 min read 3 sections 1 interview questions
Collaborative EditingOperational TransformationCRDTsGoogle Docs ArchitectureReal-Time SyncWebSocketYjs AutomergeRGA LSEQ LogootConflict ResolutionOffline EditingShareDBConcurrent Edit Convergence

Why Collaborative Editing Is a Distributed Systems Problem

Collaborative editing looks deceptively simple — multiple users type into the same document. The hard part is concurrent edit convergence: two users edit the same paragraph simultaneously, and both edits must be applied to both clients in a consistent order, producing the same final document. The naive solution — last-write-wins or pessimistic locking — either loses edits silently or serializes users to single-writer throughput, defeating the purpose of collaboration. The correct abstraction requires a formalism that guarantees two properties: convergence (all replicas that have seen the same set of operations end up with the same document state) and intention preservation (applying an operation produces the result the user intended, not a corrupted version). Getting both properties simultaneously under concurrent, out-of-order operation delivery is the core research problem. Two families of solutions have emerged over 30 years: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs). Google Docs uses OT. Notion and many newer systems use Yjs (a CRDT library). The wrong answer in an interview is picking one without explaining why and what the tradeoffs are. At Google Docs scale — over 1 billion documents, millions of simultaneous collaborative sessions, ~30 ops/min per active document — the system processes hundreds of thousands of operations per second globally, with a WebSocket latency target under 100ms from keystroke to peer delivery. Every architectural decision is downstream of concurrent edit correctness, not throughput.

IMPORTANT

The Interviewer's Actual Test

Interviewers at Google, Meta, and Notion for senior/staff roles use this question to test distributed systems correctness intuition, not just system design. The trap: most candidates jump to 'use CRDTs — they're distributed-first'. The correct answer is more nuanced. Google Docs uses OT with a central server (published in the 2010 Google Wave paper), and the reason is that OT's correctness guarantees are weaker without total ordering, which requires a central serialization point. CRDTs solve a different problem — decentralization — at the cost of tombstone overhead and algorithm complexity. Know which production systems use which, and know why.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.