Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Design Search Autocomplete / Typeahead (Google-scale)
System design deep-dive for a search typeahead service delivering sub-100ms top-K suggestions at 10B queries/day using sharded tries, precomputed top-K at each node, multi-tier caching, real-time trending pipelines, distributed systems scalability, and personalization.
Why Typeahead Is a Latency-Obsessed Read System
Search autocomplete is the inverse of a payment system — availability and latency dominate correctness. A stale suggestion is a mild annoyance; a 300ms suggestion is a product failure because users type faster than that and the UI feels laggy. Google's public latency bar for autocomplete is sub-100ms p99 end-to-end, including network, and the service itself aims for <30ms server-side. At 10B queries/day (~115K QPS average, ~500K peak), this latency budget changes what data structures are even candidates: no disk reads, no network hops between services, no database lookups per keystroke. The system has to return top-K suggestions for any prefix from tens of millions of candidate queries, and it must do so while also handling personalization, trending, spell correction, and multi-language support. The core insight senior candidates must surface: the entire working set lives in memory, the top-K is precomputed at every prefix node, and the trie is rebuilt offline — never mutated online. Candidates who propose 'query Elasticsearch on every keystroke' fail immediately; at 500K QPS, Elasticsearch would need thousands of nodes and still miss the latency bar. Candidates who say 'use a trie' without addressing how to update it without downtime, how to shard it, and how to attach top-K at each node are incomplete. The interview is testing whether you can hold in your head that typeahead is a read-only hot-path with async rebuild architecture.
Clarifying Questions to Ask First
What is the search corpus?
Web search (Google-style, open-ended queries), product search (Amazon, bounded catalog), internal search (Slack messages), or navigational (URL/app switcher)? The corpus size drives whether a pure trie fits in memory or needs a hybrid with inverted index.
What is the scale?
Queries per day, distinct queries, suggestions per prefix? Default assumption: 10B queries/day, 100M distinct queries, top-10 per prefix, p99 < 100ms end-to-end.
Personalized or global?
Global only is simpler. Per-user personalization requires a second layer and user-keyed storage. Typical mix: 7 global + 3 personal.
Multi-language and multi-region?
Each language needs its own trie (tokenization, stopwords, normalization differ). Regional variance matters: 'covid' in US vs India in 2020 had different trending candidates.
Trending / real-time?
Does the system reflect 'trending now' queries (breaking news, viral topics) with sub-minute freshness, or is daily/hourly rebuild acceptable? Real-time trending is a separate pipeline.
Spell correction?
Fuzzy matching for typos (Levenshtein distance <= 2 is typical). Adds a separate fallback path with BK-tree or SymSpell.
Safety and privacy?
Suggestions leaking PII (user's own past query shown to others) or violating policy (adult terms, self-harm) are a launch blocker. Blocklist is applied at query time, not just at ingest.