Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Design a File Storage and Sync System (Dropbox / Google Drive)

End-to-end system design for a Dropbox-scale file storage and sync platform serving 500M users and 2.5EB of data. Covers content-addressable chunking, metadata-blob separation, the sync protocol, shared-folder consistency, version history, and the garbage-collection edge cases that interviewers target.

60 min read 3 sections 1 interview questions
Dropbox DesignGoogle DriveContent Addressable StorageChunkingSHA-256DeduplicationS3ShardingMySQLSpannerCassandraSync ProtocolBloom FilterGarbage CollectionLSM Tree

The Problem Is Not Storage — It Is Sync

Candidates anchor on "store files in S3" and treat this as an object-storage question. That misses the interview. S3 solves durability (11 nines) and capacity (effectively infinite) as a commodity primitive. The real engineering challenge in a Dropbox-class system is sync: getting a user's edits on their laptop to their phone, to their collaborator's machine, within seconds, with bounded bandwidth, across intermittent networks, while handling concurrent edits, partial failures, and 500M users competing for a shared metadata backend. A second challenge is economics: at 500M users × 5GB average = 2.5EB of user data, paying $0.023/GB-month ($60M/month) of raw S3 is unaffordable if you don't deduplicate aggressively. The two problems have the same solution: content-addressable chunking. Splitting files into fixed chunks addressed by their SHA-256 hash gives you dedup across users (same file stored once), resumable uploads (re-upload only missing chunks), and efficient sync (push only changed chunks), all from the same primitive.

IMPORTANT

What Interviewers Are Testing

This is a systems design classic because it's dense with tradeoffs — metadata vs blobs, strong vs eventual consistency, push vs pull sync, dedup vs privacy. Senior signals:

  • Chunking strategy — do you propose fixed-size or content-defined (Rabin) chunks, and can you explain why content-defined is better for files with insertions?
  • Metadata/blob separation — do you put metadata and blobs in different stores, and do you understand why conflating them kills you?
  • Sync protocol — do you reach for long-poll/WebSocket or do you propose naive polling?
  • Shared folder consistency — do you handle the cross-shard problem, and do you know when Spanner-class strong consistency is worth the cost?
  • Garbage collection — do you understand that ref-counted chunk deletion has race conditions, and do you propose a safe mark-and-sweep or epoch-based approach?

The single biggest failure is treating the problem as "upload to S3" and missing the sync protocol entirely. The second biggest is proposing a single RDBMS for metadata at 1B-file scale without discussing sharding.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.