End-to-end system design for a distributed web crawler at Google/Bing scale: 15B+ pages, ~6B refreshed daily, covering URL frontier design, OPIC-based priority scoring, politeness enforcement, SimHash near-duplicate detection, distributed sharding, and freshness-vs-depth tradeoffs most resources skip entirely.

50 min read 4 sections 1 interview questions

Web Crawler DesignURL FrontierSimHash Near-Duplicate DetectionBloom Filter URL DeduplicationPoliteness QueueApache NutchMercator CrawlerConsistent Hashing Crawl ShardingDNS Pre-ResolutionOPIC ScoringRobots.txt ComplianceSpider Trap Detection

Why Web Crawlers Are Harder Than They Look

A web crawler seems deceptively simple: fetch a page, extract links, repeat. At Google's scale — crawling 15B+ pages with ~6B refreshed daily — three hard constraints make this one of the most intricate distributed systems ever built. Politeness (you cannot hammer any server faster than it can handle), deduplication (the web has massive content redundancy — studies show ~30% of pages are near-duplicates), and freshness (a breaking news article is stale in 15 minutes, a legal FAQ is fine refreshed monthly) are at fundamental tension with each other. Every design decision in a crawler traces back to resolving that tension. Interviewers testing senior/staff candidates on this topic are checking whether you understand that a web crawler is not an embarrassingly parallel problem — it is a scheduling problem. The naive 'spin up 10,000 fetch workers and let them run' approach will get you IP-banned by every major CDN within hours, will crawl the same near-duplicate content thousands of times, and will waste 80% of your crawl budget on low-value pages. Google's Caffeine pipeline (launched 2010, described by Singhal & team) moved from batch crawl-refresh to a continuous pipeline that prioritizes by freshness signal — a fundamental architectural departure from MapReduce-era batch crawlers like Apache Nutch. The second counterintuitive insight: storage is not the bottleneck. At $20/TB on S3, storing 15B pages of compressed HTML costs roughly $3M/year. The real bottlenecks are network bandwidth (fetching at scale), DNS (resolving ~6B hostnames daily), and CPU (parsing, dedup-fingerprinting, and link extraction). Any design that opens with 'how do we store all these pages' is solving the wrong problem.

IMPORTANT

What Interviewers Are Actually Testing

Senior candidates are expected to know the three-layer architecture (frontier, fetcher, processor) and why each layer exists. Staff candidates are expected to explain OPIC scoring, SimHash threshold selection, politeness queue design with backpressure, and how consistent hashing shards the frontier across crawler nodes. The single most common failure mode: treating crawl deduplication as 'just use a Bloom filter on URLs'. URL deduplication is necessary but not sufficient — you also need content deduplication (two different URLs serving identical HTML), which requires SimHash or MinHash. A candidate who misses this doesn't pass senior bar at Bing or Google.

Clarifying Questions to Ask Before Designing

Web-scale or focused domain?

Web-scale (Google/Bing: 15B+ pages, all TLDs) vs focused domain crawler (Amazon product catalog, news aggregator, legal database). Scope changes the frontier design: domain-focused crawlers can use per-site rules and human curation; web-scale must automate priority entirely via signals like PageRank and click data.

What freshness SLOs are required?

News publishers expect pages indexed within 15 minutes of publication (Google's Caffeine target). E-commerce: 1–24 hours for price/stock changes. Static content: 30 days acceptable. The answer determines whether you need a continuous priority queue vs a batch refresh cycle.

Is JavaScript rendering required?

~60-70% of the modern web is JS-rendered (SPAs). Without a headless browser (Chromium via Puppeteer or Chrome DevTools Protocol), you miss a majority of page content. JS rendering costs ~10–20x more CPU per page vs raw HTML fetch; you must decide upfront which URL classes warrant it.

Robots.txt and crawl-delay compliance?

Assume yes — disrespecting robots.txt is both legally and reputationally risky. Crawl-delay directive in robots.txt can range from 1s to 60s per domain. A crawler that ignores crawl-delay will be IP-banned and potentially face legal action under the CFAA (Computer Fraud and Abuse Act).

Multi-language and international support?

Unicode normalization (NFC vs NFD) affects URL deduplication — the same URL encoded differently is the same page. IDN (Internationalized Domain Names) require punycode normalization. Crawling Baidu's index space or .cn/.jp TLDs adds encoding complexity (GB18030, Shift-JIS) that naive UTF-8 parsers fail on.

What is the crawl budget?

Google allocates a crawl budget per host — a proxy for server capacity and site importance. Small sites get 100–500 pages/day; large, high-PageRank sites get unlimited crawling. This budget is the constraint that drives priority scoring; without it, you'd crawl every link infinitely.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade