ML System Design
Recommendation, ranking, search, fraud detection — end-to-end ML systems with serving architectures, feature stores, training pipelines, and online/offline evaluation.
guides
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
How top ML teams run experiments that actually produce trustworthy conclusions — sample size calculation, randomization units, guard rails, CUPED variance reduction, network effects, and the organizational mistakes that make most A/B tests misleading.
Data Pipelines for ML: Batch, Streaming, and Event Architecture
How production ML data pipelines are actually built — Kafka for event collection, Spark for batch feature engineering, Flink for real-time aggregations, and the architectural decisions that determine whether your model trains on fresh or stale data.
Embeddings & Vector Databases: ANN Search at Scale
How embeddings power search, recommendations, and retrieval — and how to build the index that serves them at millisecond latency. Covers HNSW vs IVF+PQ, tuning M and ef parameters, billion-scale architecture, and when to use Pinecone vs FAISS vs pgvector.
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
Production ML evaluation is fundamentally different from offline evaluation. Covers shadow deployment, champion-challenger A/B testing, canary rollouts, SLO design for ML systems, rollback triggers, and the metrics that reveal model degradation before users notice. The end-to-end playbook for safely deploying and monitoring ML models.
Experiment Tracking & Model Registry: The Version Control for ML
How production ML teams manage the model lifecycle from experiment to production — MLflow vs Weights & Biases, what metadata a model must carry, promotion workflows with gated approvals, model lineage for debugging, and the rollback mechanism that makes safe deployments possible.
Feature Stores: Online/Offline Architecture & Training-Serving Consistency
Deep dive into feature store architecture — the infrastructure every production ML system needs but most candidates can't explain. Covers the two-tier design, point-in-time correct joins, training-serving skew, and how to choose between Feast, Tecton, and cloud-managed options.
ML Pipelines & Orchestration: Airflow, Kubeflow, and CI/CD for Models
How production ML teams automate the full model lifecycle — from data ingestion through training, evaluation, and deployment. Covers Airflow vs Kubeflow Pipelines, containerized training steps, automated model validation gates, and the CI/CD practices that separate mature ML teams from ad-hoc ones.
ML Model Deployment Fundamentals: Shipping Safely in Production
A practical foundation for deploying ML models: packaging, serving topologies, rollout strategies, and post-deploy monitoring. Covers shadow mode, canary releases, drift detection, and rollback design.
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
How to design the serving layer for ML models in production — when to use batch pre-computation vs real-time inference, how to safely deploy new models via shadow and canary patterns, and how to structure a multi-stage serving pipeline within a latency budget.
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
Production ML models fail silently. This guide covers the three-layer monitoring stack (data drift, concept drift, output drift), PSI thresholds, KL divergence, distinguishing data drift from concept drift (they require different fixes), and how to build retraining triggers that aren't noisy.
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
The most common ML interview trap: candidates optimize offline metrics but can't explain why they diverge from online results. Covers AUC vs CTR, NDCG vs session length, position bias, novelty effects, counterfactual evaluation, and the right metric for each stage of an ML system.
Distributed Training: Data Parallelism, Model Parallelism, and FSDP
How to scale model training from a single GPU to thousands. Covers data parallelism with Ring AllReduce, model/tensor/pipeline parallelism for LLMs, PyTorch DDP vs FSDP2, and how to choose the right strategy based on model size vs data volume.
GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization
The engineering decisions that determine whether your model serves at 10ms or 200ms — GPU selection, quantization (INT8/FP16/FP8), dynamic batching, KV cache management, and when to use Triton vs vLLM vs TensorRT-LLM.
Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender
The dominant architecture powering Google, YouTube, TikTok, Pinterest, and Spotify — two-tower retrieval followed by multi-stage ranking. Covers the fundamental constraint that makes this necessary, in-batch negatives, hard negative mining, and the full 4-stage production pipeline.
Vector Search at Scale: HNSW, IVF-PQ, FAISS, and Production ANN Systems
Approximate Nearest Neighbor (ANN) search is the retrieval backbone of RAG, recommendation systems, semantic search, and visual similarity. Master HNSW graph construction, IVF-PQ compression, FAISS vs Qdrant vs pgvector selection, recall-latency tradeoffs, and hybrid dense+sparse search. Includes production sizing and indexing strategy for 1B+ vector corpora.
ML System Design: Abuse Detection — Account Takeover, Bots, and Velocity Beyond Spam
Design a cross-product abuse platform distinct from content spam — credential stuffing, account takeover (ATO), synthetic accounts, scraping, and collusion rings. Covers device graphs, velocity features in Redis, challenge escalation (CAPTCHA, step-up auth), feedback loops when labels are delayed, and why Meta-style integrity teams separate abuse from policy-violating content classifiers.
MLSD Case Study: Ad Click Prediction at Marketplace Scale
Design a production CTR prediction system for ads at Meta/Google scale. Covers COEC calibration, delayed conversions, feature engineering for sparse+dense signals, multi-stage serving under strict latency budgets, and the failure loops that most interview answers miss.
MLSD Case Study: Churn Prediction with Survival and Uplift Modeling
Design a Spotify/Netflix-style churn prevention system using survival models, causal uplift targeting, and intervention policy optimization. Covers label definition, intervention economics, and production monitoring pitfalls.
MLSD Case Study: Multimodal Content Moderation Systems
Design a TikTok/YouTube/Meta-style content moderation stack with multimodal models, policy-aware inference, human-in-the-loop review, and continuous policy evolution. Covers latency tiers, precision/recall tradeoffs by harm class, and model-policy coupling.
ML System Design: Customer LTV, Survival Modeling, and Uplift for Treatment Targeting
Design a production customer lifetime value (CLV) system for subscription and marketplace businesses — from probabilistic churn (BG/NBD, Gamma-Gamma spend), survival curves with censoring, to uplift modeling for CRM campaigns. Covers why naive regression on historical LTV leaks future information, how Shopify-style merchants use CLV for acquisition bids, and evaluation with calibrated dollar errors plus policy simulation.
ML System Design: Demand Forecasting System
Design the demand forecasting system used by Uber, Lyft, DoorDash, and Amazon — end-to-end. Covers the three hard problems that make this uniquely challenging: spatial ML (demand is correlated across geography, not independent per zone), online learning (marketplace conditions change faster than batch retraining), and the feedback loop where demand forecasts drive pricing which affects actual demand. Includes H3 spatial graphs, temporal GNNs, online adaptation with drift detection, and why WMAPE beats MAPE for imbalanced demand.
MLSD Case Study: Document Understanding & Enterprise NLP Classification
Production document AI for invoices and contracts: OCR, layout-aware encoders, calibrated per-field extraction, human review routing, and template drift. Covers hybrid rule/ML cascades, per-field F1 under imbalance, and audit-grade logging — the system design interviewers expect beyond flatten-all-text BERT.
ML System Design: Dynamic Pricing & Surge (Marketplace)
Design a production dynamic pricing system like Uber surge or DoorDash peak pricing — end-to-end. Covers the three problems most prep skips: pricing is a closed-loop control problem (price moves both supply and demand), the correct ML framing is contextual bandits or MDPs — not plain regression on historical fares — and guardrails (caps, fairness, emergency optics) that constrain the learned policy. Includes elasticity estimation, exploration budgets, and why naive demand forecasting without supply response mis-prices the marketplace.
ML System Design: E-commerce Recommendation System
Design Amazon/Alibaba-scale product recommendation end-to-end — from session-aware candidate retrieval across billions of products to multi-task neural ranking optimizing for click, add-to-cart, and purchase simultaneously. Covers the exact architecture used in production: real-time session modeling, co-purchase graph retrieval, cold start for new products and new users, Thompson sampling for category discovery, and the unique constraints of e-commerce (inventory, price, return rate, margin). Includes latency budget analysis, training data construction with purchase attribution, and what each level (mid/senior/staff) must cover.
ML System Design: ETA Prediction System
Design the ETA prediction system used by Uber (DeepETA), Lyft, and DoorDash — end-to-end. Covers the two architectural insights that define this problem: the physics-first hybrid approach (ML refines a routing engine, not replaces it) and the online-offline feature split that makes sub-10ms inference possible. Includes H3 geohashing for spatial features, the Linear Transformer trick for latency, quantile regression for uncertainty, and the compounding error problem in multi-stop predictions.
ML System Design: Real-Time Fraud Detection
Design a production fraud detection system used by Stripe, PayPal, and Visa — end-to-end. Covers the three hard problems nobody teaches: extreme class imbalance (typically ~0.1% fraud rate), cost-sensitive learning where FN ≠ FP, and multi-stage inference under 100ms latency. Includes the optimal threshold formula, graph neural networks for fraud rings, adversarial drift, and the suppression bias feedback loop that silently kills deployed fraud models.
ML System Design: Instagram Feed Ranking System
The second canonical ML system design problem. Design Instagram's personalized feed ranking end-to-end — from multi-source candidate aggregation (friends, follows, recommended) to multi-task neural ranking predicting 10+ user actions simultaneously. Covers the exact architecture Meta uses in production, the value model that combines action predictions into a single score, integrity signals, SEV-driven guardrails, cold start for social graphs, and what each level (mid/senior/staff) must cover to pass.
ML System Design: Job Recommendation (LinkedIn-Style Marketplace)
Design a production job recommendation system for a professional marketplace — end-to-end. Covers cold-start and short-lived job postings, the LinkSAGE / LiGNN pattern: **nearline** GNN embeddings as **features** into a low-latency two-tower ranker (not full GNN on every list load), position bias and delayed labels (apply → hire), eligibility hard-filters, and two-sided evaluation with employer quality guardrails. Citations: LiGNN and LinkSAGE (arXiv 2024) for large-scale job graph learning at LinkedIn.
ML System Design: LLM Serving Systems
Design a production LLM serving system from first principles — covering PagedAttention and KV cache management, continuous batching for 2-5× throughput gains, multi-LoRA serving with S-LoRA and InfiniLoRA, the full RLHF pipeline (SFT → reward model → PPO vs DPO vs GRPO), and cost-per-token engineering. Includes break-even analysis for self-hosting vs cloud, failure mode catalog, and what each interview level must cover.
ML System Design: Music Recommendation System
Design Spotify/Apple Music-scale music recommendation end-to-end — from audio-aware product retrieval to sequential session modeling that captures the unique temporal dynamics of music consumption. Covers the production architectures behind Discover Weekly (batch exploration), Radio (session exploitation), and Release Radar (cold start for new tracks). Deep dives into GRU4Rec and SASRec for session modeling, ACARec for artist-catalog-based cold start, contextual bandits on the homepage, skip-rate debiasing, and the listen-skip paradox. Includes what each level (mid/senior/staff) must cover.
ML System Design: Notification Ranking System
Design the notification ranking system used by LinkedIn, Instagram, and Reddit — end-to-end. Covers the three problems that make this uniquely hard: multi-objective optimization (engagement vs fatigue vs retention), user fatigue modeling with adaptive per-user budgets, and why the budget constraint is more important than the ranking model. Includes Instagram's diversity-aware demotion framework, LinkedIn's Decision Transformer for sequential notification policy, and the suppression feedback loop from sending too many notifications.
ML System Design: Query Understanding — Rewriting, Expansion, Classification, and Spell Correction
Design the query understanding stack behind web search, e-commerce search, and internal enterprise retrieval — tokenization, spelling, intent classification, synonym expansion, PII redaction, and safe query rewriting for vector + lexical hybrid retrieval. Covers how Amazon-style search decomposes the problem into cascaded lightweight models under single-digit millisecond budgets before heavy ranking.
ML System Design: Real-Time Anomaly Detection at Scale
Design a production real-time anomaly detection system for metrics, logs, and business KPIs — end-to-end. Covers the three gaps in most answers: pointwise z-scores miss *multivariate* failures, unsupervised models cause *alert fatigue* without severity and incident context, and streaming state (Flink keyed windows) must respect *event time* and *exactly-once* semantics for financial or SRE use cases. Includes Isolation Forest + robust baselines, suppression and correlation grouping, and wiring to paging with SLO burn.
MLSD Case Study: End-to-End Recommender System
Design a production recommender stack from candidate generation to ranking, re-ranking, experimentation, and monitoring. Covers retrieval-ranking tradeoffs, feature freshness, exploration, and feedback-loop mitigation.
ML System Design: Real-Time Bidding Optimization
Design a production DSP bidding system under 50ms — covering contextual bandits (UCB vs Thompson Sampling), budget pacing from PID controllers to RL, distributed budget state with token buckets, and ultra-low-latency hot-path engineering. Includes the auction bias problem with IPS correction, bid shading for first-price auctions, and failure mode analysis for production RTB systems.
MLSD Case Study: Search Ranking System
Design web/ecommerce search ranking with lexical + vector retrieval, multi-stage ranking, and freshness-aware indexing. Covers query understanding, relevance labels, and online experimentation.
ML System Design: Social Feed Ranking System
Design a production-grade social feed ranking system from scratch — the architecture powering Twitter/X, LinkedIn, Reddit, and Threads. Covers multi-source candidate retrieval (in-network + out-of-network), multi-task value model predicting 10+ user actions, recency engineering, echo chamber feedback loops, counterfactual logging, and the exact latency budget for a <200ms feed load. Includes the open-sourced X (Twitter) algorithm analysis, SimCluster-based out-of-network discovery, and what each level (mid/senior/staff) must cover.
MLSD Case Study: Graph-Aware Spam Detection
Design a Gmail/LinkedIn-style spam detection system combining content models, graph-based abuse signals, and velocity features. Covers adversarial adaptation, streaming detection, and class-specific action policies.
ML System Design: Video Recommendation System
The canonical ML system design problem. Design YouTube's video recommendation engine end-to-end — from billion-scale candidate retrieval to transformer-based multi-task ranking to A/B experimentation. Covers the exact architecture used in production, latency budget breakdowns, negative sampling, feedback loop pathologies, and what each level (mid/senior/staff) must cover to pass.
ML System Design: Visual Search at Billion Scale
Design Pinterest's visual search system end-to-end — from contrastive learning with hard negative mining to billion-scale ANN retrieval with HNSW, ScaNN, and DiskANN. Covers the multi-stage retrieval-ranking funnel, index update strategies for live catalogs, latency budget engineering, and the failure modes that production systems hit. Includes company comparisons across Pinterest, Google Lens, and Amazon.
How to Approach an ML System Design Interview
The mindset, signal management, and time strategy for ML system design interviews. Covers the offline-to-online metric translation, training-serving skew awareness, monitoring mindset, and the failure modes that cause strong ML engineers to underperform on production ml interview loops.
How to Design at MLSD: Blank Whiteboard to Production ML
The mechanical playbook for ml system design interview execution. Covers product-to-ML translation, candidate-ranking funnels, two-tower retrieval, feature store architecture, model serving (Triton, vLLM), monitoring (PSI, drift), and reference designs for feeds, fraud, and search.
ML System Design: 6-Step Framework
The definitive framework for ML system design interviews. Covers all 6 steps with exact timing, what interviewers look for at each step, and how to stand out from other candidates.
ML Fairness and Bias: Metrics, Trade-offs, and Mitigation Strategies
Fairness in ML systems is a first-class engineering problem, not just a policy concern. This guide covers the four main fairness definitions (demographic parity, equalized odds, calibration, individual fairness), their mathematical incompatibility, bias sources across the ML pipeline, and practical mitigation strategies — tested increasingly at Google, Meta, Microsoft, and AI-first companies in senior ML system design rounds.