Sections

0/9

Related Guides

Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender

ML System Design

40m

Feature Stores: Online/Offline Architecture & Training-Serving Consistency

ML System Design

35m

Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets

ML System Design

30m

ML System Design: 6-Step Framework

ML System Design

40m

Quiz

← Back to Library

ML System Design·Intermediate

Embeddings & Vector Databases: ANN Search at Scale

How embeddings power search, recommendations, and retrieval — and how to build the index that serves them at millisecond latency. Covers HNSW vs IVF+PQ, tuning M and ef parameters, billion-scale architecture, and when to use Pinecone vs FAISS vs pgvector.

35 min read 9 sections 6 interview questions

EmbeddingsVector DatabaseFAISSHNSWANN SearchApproximate Nearest NeighborPineconeWeaviateIVFProduct QuantizationTwo-TowerSemantic SearchDense Retrieval

What Embeddings Are and Why They Enable Scale

An embedding is a fixed-length dense vector (typically 64–1536 dimensions) that encodes the semantic meaning of an entity. Items that are semantically similar have embeddings that are close together in vector space (measured by cosine similarity or dot product).

The transformative property for large-scale ML: precomputation. A user's embedding can be computed once per request (or cached for minutes). An item's embedding can be computed offline for all items and stored. Finding the 100 most relevant items for a user is then equivalent to finding the 100 nearest item vectors to the user's query vector — the Approximate Nearest Neighbor (ANN) problem.

Without embeddings: scoring 100M items per user request requires 100M model forward passes. At 0.1ms each, that's 2.8 hours per request. Infeasible.

With embeddings + ANN index: compute one user embedding (1ms), search the ANN index (10–30ms) → retrieve top-1000 candidates. Then run an expensive reranker on 1000 items (50ms). Total: ~80ms. Feasible at YouTube scale.

This is the fundamental insight behind every large-scale recommendation, search, and retrieval system.

Decision Framework: Choosing Your ANN Index and Vector Database

What is your catalog size?

< 1M vectors: brute-force exact search (FAISS Flat index) is fast enough and avoids ANN approximation errors. 1M–500M vectors: HNSW (FAISS HNSW or Qdrant/Weaviate) — high recall (>95%), sub-50ms query. >500M vectors: IVF+PQ — quantize vectors to 8–32 bytes each, cluster into 10K–100K centroids. Accepts ~5% recall drop in exchange for 10–50× memory reduction.

What is your latency budget?

< 5ms: HNSW in-memory on a machine with the entire index loaded in RAM. 10–50ms: IVF+PQ with an appropriate nprobe value (higher nprobe = better recall, higher latency). > 50ms: acceptable for batch retrieval (offline candidate generation), not for real-time serving.

How often do you add new items?

HNSW supports incremental inserts with no performance degradation — add new items without rebuilding. IVF requires periodic full re-indexing (daily rebuild is typical) because new vectors assigned to existing centroids can degrade recall over time. If your catalog changes >10% per day, prefer HNSW for freshness.

Do you need filtered search?

If queries require metadata filters ('retrieve most similar items that are also in-stock and in category=shoes'), pre-filtering (filter first, then ANN) causes poor recall if the filter is selective. Post-filtering (ANN then filter) requires retrieving K×(1/filter_selectivity) candidates to ensure K results survive the filter. Managed vector databases (Weaviate, Pinecone, Qdrant) handle this with ACORN or segment-based filtering. FAISS does not.

Do you need multi-modal or multi-vector search?

If your embeddings combine text + image (product queries), you need a vector database that supports multi-vector per document (Weaviate's ColBERT support) or hybrid dense + sparse (BM25 + embedding) search. FAISS is pure dense vector — no hybrid support. Qdrant and Weaviate support hybrid out of the box.

NOTE

Embedding Models in Production

Embeddings are produced by neural encoders trained on task-specific data:

Item embeddings (recommendations): Two-tower model's item tower. Input: item features (title, category, tags, image features). Output: 256-dim vector.

User embeddings (recommendations): Two-tower model's user tower. Input: interaction history, demographics. Output: 256-dim vector matching item embedding space.

Text embeddings (search): BERT-style models (sentence-transformers, text-embedding-3-large). Output: 768–1536-dim vector. All documents in the corpus are embedded offline. Query is embedded at search time.

Image embeddings (visual search): CNN or ViT encoder. ResNet-50 produces 2048-dim features, compressed to 128–256-dim via a projection head trained with contrastive loss.

HNSW — The Default Choice for Sub-Billion Scale

Hierarchical Navigable Small World (HNSW) (Malkov & Yashunin, 2016 — https://arxiv.org/abs/1603.09320) is the algorithm behind Weaviate, Qdrant, Milvus's in-memory index, and pgvector's HNSW index type. It's the right choice when your entire index fits in RAM and you need sub-10ms latency.

HNSW builds a multi-layer graph where each node (vector) connects to its approximate nearest neighbors. Search starts at the top (sparse) layer and greedily navigates toward the query, then descends to lower (denser) layers for precise local search.

The three parameters that matter in production:

M (connections per node per layer): Controls graph density. Higher M = better recall, more memory, slower index build. Production default: M=16 for balanced recall/memory. Use M=32 for high-recall applications (medical/legal search where missing a result has serious consequences). Each node stores M×2 connections as 4-byte integers: at M=16, that's 128 bytes of connection overhead per vector.

ef_construction (search width during index build): Higher values improve index quality at build time but slow construction linearly. Production recommendation: ef_construction=200 for general search, ef_construction=400 for high-recall requirements. Building 10M vectors at ef_construction=200 takes ~2–4 hours on a 32-core machine.

ef_search (search width at query time): The most important parameter for latency/recall trade-off. Can be changed without rebuilding the index. Benchmark: ef_search=50 → ~90% recall at ~5ms; ef_search=200 → ~98% recall at ~15ms (for 10M 768-dim vectors). Always benchmark your specific embedding size and dataset.

Memory requirement: (d × 4 + M × 2 × 4) bytes per vector. For 10M vectors at d=256, M=16: 10M × (1024 + 128) = ~11GB. Fits on a single machine. For 100M vectors: ~110GB. Starts requiring distributed sharding or moving to IVF+PQ.

HNSW vs IVF+PQ — Choosing by Dataset Size

Rendering diagram...

IVF + Product Quantization — Billion-Scale with Memory Compression

When dataset size exceeds available RAM, Product Quantization (PQ) compresses vectors dramatically.

How IVF works: k-means clusters the entire vector space into Nlists centroids (typically 4,096–65,536). Each vector is assigned to its nearest centroid. At query time, only the nprobe nearest centroids are searched — reducing the search space from N to (nprobe/Nlists × N). With Nlists=4,096 and nprobe=64: only ~1.5% of vectors are compared per query (typical).

How PQ compresses vectors: A 1,536-dim vector is split into M_pq=48 sub-vectors of 32 dimensions each. Each sub-vector is quantized to 1 byte (256 possible values, learned via k-means). Result: a 1,536-dim float32 vector (6,144 bytes) is compressed to 48 bytes — a 128× reduction. At 100M vectors: 614GB uncompressed → 4.8GB compressed. Fits easily in RAM.

The trade-off: PQ is lossy compression. Recall@100 typically drops from ~99% (exact) to ~92–96% (IVF+PQ) depending on compression ratio. For most recommendation/search systems, this is acceptable — the ranking stage downstream handles the small precision loss.

FAISS index selection by scale:

Under 1M: IVF4096,Flat (nprobe=64)
1M–10M: IVF65536_HNSW32,Flat (nprobe=32)
10M–100M: IVF262144_HNSW32,PQ48x8 (nprobe=64)
100M–1B: IVF1048576_HNSW32,PQ64x8 (nprobe=128)

FAISS Index Setup — HNSW for In-Memory, IVF+PQ for Scale

pythonfaiss_index.py

import faiss
import numpy as np

# --- HNSW: <100M vectors, sub-10ms latency ---
def build_hnsw_index(embeddings: np.ndarray, dim: int = 256) -> faiss.Index:
    """
    Production HNSW config: M=16 for recall/memory balance.
    ef_construction=200: high-quality index build.
    ef_search=100: ~90% recall at ~5ms for 10M 256-dim vectors.
    """
    index = faiss.IndexHNSWFlat(dim, 16)  # M=16
    index.hnsw.efConstruction = 200  # Set BEFORE adding vectors
    faiss.normalize_L2(embeddings)    # Normalize for cosine similarity
    index.add(embeddings)
    index.hnsw.efSearch = 100         # Set AFTER; can be changed at query time
    return index

# --- IVF+PQ: >100M vectors, memory-constrained ---
def build_ivfpq_index(
    embeddings: np.ndarray,
    dim: int = 256,
    nlist: int = 65536,    # √N rule: √(100M) ≈ 10K; use 4× → 65536
    m_pq: int = 32,        # PQ sub-vectors; dim must be divisible by m_pq
    bits: int = 8,         # 256 centroids per sub-quantizer
) -> faiss.Index:
    """
    IVF+PQ: 256-dim × 4 bytes = 1024 bytes uncompressed
            → 32 sub-vectors × 1 byte = 32 bytes compressed (32× reduction)
    At 100M vectors: 102GB → 3.2GB. Recall@100 ≈ 92-95%.
    """
    quantizer = faiss.IndexFlatIP(dim)  # Inner product for cosine-normalized vecs
    index = faiss.IndexIVFPQ(quantizer, dim, nlist, m_pq, bits)
    # Training required for IVF+PQ — sample 10-100× nlist vectors
    train_size = min(nlist * 50, len(embeddings))
    train_sample = embeddings[np.random.choice(len(embeddings), train_size)]
    index.train(train_sample)
    index.add(embeddings)
    index.nprobe = 64  # Search 64/65536 ≈ ~0.1% of clusters; tune for recall target
    return index

def query_index(index: faiss.Index, query: np.ndarray, k: int = 100):
    faiss.normalize_L2(query)
    distances, indices = index.search(query, k)
    return indices, distances

Managed Vector Database Comparison

Platform	Index Type	Hosting	Hybrid Search	Best For
Pinecone	Proprietary (HNSW-based)	Fully managed SaaS	Yes (sparse + dense)	Teams wanting zero ops; serverless pricing scales to zero
Weaviate	HNSW (in-memory)	Self-hosted or managed	Yes (BM25 + dense)	When you need text + vector in one system
Qdrant	HNSW (+ disk-mapped)	Self-hosted or managed	Yes	High performance self-hosted; Rust-based, memory efficient
Milvus	HNSW + IVF+PQ + DiskANN	Self-hosted	Yes	Billion-scale self-hosted; most complete index support
pgvector	HNSW or IVF_FLAT	PostgreSQL extension	No (SQL joins only)	Teams already on PostgreSQL; OLTP + vector in one DB
FAISS	All index types	Library (not a server)	No	Custom serving; maximum control; used inside other systems

⚠ WARNING

The Embedding Freshness Problem

Item embeddings are computed offline and stored in the ANN index. When you add a new item (a new YouTube video, a new product), it has no embedding in the index. The system cannot retrieve it via ANN search — it's invisible to the recommendation system.

This is the cold start problem at the index level (not just the model level). Solutions:

Re-embed and re-index daily: Run the item encoder on all new items, insert into the FAISS index. Downside: new items invisible for up to 24 hours.
Incremental index updates: HNSW supports incremental add(). Add new item embeddings to the index in near-real-time. More complex operationally.
Two-index strategy: Maintain a large HNSW index for established items (weekly rebuild) and a small 'new items' index (daily or hourly rebuild). At query time, search both and merge results. This is what many production systems do.
Content-based fallback: For new items with no embedding, fall back to metadata-based retrieval (same category, same creator, similar title keywords via BM25).

Mention this in interviews — it shows you understand the gap between offline indexing and real-world item freshness requirements.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.