Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
ML System Design: Video Recommendation System
ML System Design
ML System Design: Instagram Feed Ranking System
ML System Design
ML System Design: E-commerce Recommendation System
ML System Design
ML System Design: 6-Step Framework
ML System Design
ML Evaluation Metrics: The Complete Guide
Machine Learning
ML System Design: Visual Search at Billion Scale
Design Pinterest's visual search system end-to-end — from contrastive learning with hard negative mining to billion-scale ANN retrieval with HNSW, ScaNN, and DiskANN. Covers the multi-stage retrieval-ranking funnel, index update strategies for live catalogs, latency budget engineering, and the failure modes that production systems hit. Includes company comparisons across Pinterest, Google Lens, and Amazon.
Why Visual Search Is Different From Image Retrieval
Visual search is the ML system design problem that separates candidates who understand retrieval from those who don't. It appears simple on the surface — take an image, find similar images — but the actual engineering challenge is substantially harder.
Image retrieval finds exact or near-exact duplicates. The query and corpus share the same domain, camera, and often scene. Solved with perceptual hashing (pHash, dHash) or simple CNN feature matching. This is a 2015-era solved problem.
Visual search finds semantically similar items across domain shifts. A photo of a red sofa taken in a user's living room must match a studio product image of a similar sofa with a white background. A street-style photo of a floral dress must match the exact product across retailer catalogs shot with different cameras, in different lighting, from different angles.
The model must learn which visual features are semantically relevant (style, shape, color, texture) and which are artifacts of capture conditions (background, lighting, camera angle). This distinction forces specific architecture and training decisions.
The scale is extreme. Pinterest serves 300 million monthly users with visual search over a catalog of billions of Pins. Google Lens processes over 12 billion visual queries monthly. Amazon's "Search by Image" drives a double-digit percentage of mobile product discovery. These are not toy systems.
Two failure modes dominate interview answers on this topic:
- Candidates who understand the ML (contrastive learning, CLIP) but cannot design the serving layer
- Candidates who understand the system (vector databases, ANN) but cannot speak to the training pipeline
Neither profile passes a senior FAANG interview. Here is what you actually need to know.
What Interviewers Are Evaluating at Each Level
Mid-level: Can you describe the two-tower architecture for visual embedding? Do you know what approximate nearest neighbor search is and why you need it? Can you name FAISS as a concrete tool?
Senior-level: Can you explain the contrastive loss and the progression from random to hard negatives? Can you give a latency budget broken down by stage? Do you know when to use HNSW vs ScaNN vs DiskANN based on dataset size? Do you address the index update problem for live catalogs?
Staff-level: Do you design the offline hard negative mining pipeline end-to-end? Can you explain the delta index + merge pattern for real-time catalog updates? Do you quantify embedding dimension tradeoffs? Can you describe composed image retrieval and why vanilla CLIP falls short for fine-grained fashion or product search?
Clarifying Questions — Ask These First
What catalog type and scale?
E-commerce products (structured, studio images, bounded catalog) vs general images (unstructured, any content, unbounded corpus) require different architectures. For products: 1B items, 200ms latency. For Pinterest-style: 900M+ Pins including user-generated content.
What type of query?
Pure image query (snap-to-shop) vs image + text (show me this but in blue) vs text query sharing the same embedding space. Multi-modal query fusion is a significant architectural decision.
What does 'similar' mean?
Exact product match (same SKU from different angles) vs style similarity (same aesthetic category) vs complementary items (items that go well together). These require different training objectives and different negative sampling strategies.
Latency and freshness requirements?
Under 200ms end-to-end is standard. For live inventory (flash sales, restocks), new items must appear in search within minutes — not after a nightly rebuild. This determines index update architecture.
What's the business model?
Driving purchase conversion (Amazon) vs driving engagement and discovery (Pinterest) vs general image understanding (Google Lens). The business model shapes what you optimize the ranking stage for.