Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

ML System Design: Visual Search at Billion Scale

Design Pinterest's visual search system end-to-end — from contrastive learning with hard negative mining to billion-scale ANN retrieval with HNSW, ScaNN, and DiskANN. Covers the multi-stage retrieval-ranking funnel, index update strategies for live catalogs, latency budget engineering, and the failure modes that production systems hit. Includes company comparisons across Pinterest, Google Lens, and Amazon.

60 min read 4 sections 1 interview questions
Visual SearchContrastive LearningCLIPSigLIPTwo-Tower NetworksANNFAISSScaNNHNSWDiskANNHard Negative MiningImage RetrievalMulti-ModalCold StartIndex Updates

Why Visual Search Is Different From Image Retrieval

Visual search is the ML system design problem that separates candidates who understand retrieval from those who don't. It appears simple on the surface — take an image, find similar images — but the actual engineering challenge is substantially harder.

Image retrieval finds exact or near-exact duplicates. The query and corpus share the same domain, camera, and often scene. Solved with perceptual hashing (pHash, dHash) or simple CNN feature matching. This is a 2015-era solved problem.

Visual search finds semantically similar items across domain shifts. A photo of a red sofa taken in a user's living room must match a studio product image of a similar sofa with a white background. A street-style photo of a floral dress must match the exact product across retailer catalogs shot with different cameras, in different lighting, from different angles.

The model must learn which visual features are semantically relevant (style, shape, color, texture) and which are artifacts of capture conditions (background, lighting, camera angle). This distinction forces specific architecture and training decisions.

The scale is extreme. Pinterest serves 300 million monthly users with visual search over a catalog of billions of Pins. Google Lens processes over 12 billion visual queries monthly. Amazon's "Search by Image" drives a double-digit percentage of mobile product discovery. These are not toy systems.

Two failure modes dominate interview answers on this topic:

  • Candidates who understand the ML (contrastive learning, CLIP) but cannot design the serving layer
  • Candidates who understand the system (vector databases, ANN) but cannot speak to the training pipeline

Neither profile passes a senior FAANG interview. Here is what you actually need to know.

TIP

What Interviewers Are Evaluating at Each Level

Mid-level: Can you describe the two-tower architecture for visual embedding? Do you know what approximate nearest neighbor search is and why you need it? Can you name FAISS as a concrete tool?

Senior-level: Can you explain the contrastive loss and the progression from random to hard negatives? Can you give a latency budget broken down by stage? Do you know when to use HNSW vs ScaNN vs DiskANN based on dataset size? Do you address the index update problem for live catalogs?

Staff-level: Do you design the offline hard negative mining pipeline end-to-end? Can you explain the delta index + merge pattern for real-time catalog updates? Do you quantify embedding dimension tradeoffs? Can you describe composed image retrieval and why vanilla CLIP falls short for fine-grained fashion or product search?

Clarifying Questions — Ask These First

01

What catalog type and scale?

E-commerce products (structured, studio images, bounded catalog) vs general images (unstructured, any content, unbounded corpus) require different architectures. For products: 1B items, 200ms latency. For Pinterest-style: 900M+ Pins including user-generated content.

02

What type of query?

Pure image query (snap-to-shop) vs image + text (show me this but in blue) vs text query sharing the same embedding space. Multi-modal query fusion is a significant architectural decision.

03

What does 'similar' mean?

Exact product match (same SKU from different angles) vs style similarity (same aesthetic category) vs complementary items (items that go well together). These require different training objectives and different negative sampling strategies.

04

Latency and freshness requirements?

Under 200ms end-to-end is standard. For live inventory (flash sales, restocks), new items must appear in search within minutes — not after a nightly rebuild. This determines index update architecture.

05

What's the business model?

Driving purchase conversion (Amazon) vs driving engagement and discovery (Pinterest) vs general image understanding (Google Lens). The business model shapes what you optimize the ranking stage for.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.