The mechanical playbook for ml system design interview execution. Covers product-to-ML translation, candidate-ranking funnels, two-tower retrieval, feature store architecture, model serving (Triton, vLLM), monitoring (PSI, drift), and reference designs for feeds, fraud, and search.

45 min read 3 sections 1 interview questions

MLSD DesignTwo-TowerFAISSLightGBMFeature StoreA/B TestingMDEPSIRecommendation SystemsTriton Inference ServervLLMNegative SamplingCold StartPoint-in-Time JoinsDLRM

Design Is a Mechanical Skill — Treat It That Way

Strong MLSD candidates do not invent ML systems from scratch in 45 minutes. They run a playbook: a fixed sequence of patterns and templates that produce defensible designs reliably. The skill is not raw creativity — it is knowing which pattern fits which constraint, and assembling the patterns under time pressure.

This page is the mechanical playbook for ML system design execution. It assumes you've internalized the meta (covered on the companion page) and focuses on what to draw and why.

The rule that organizes everything: the product question forces an ML problem type, the ML problem type forces an objective, the objective forces a model class, the scale forces a serving architecture, and the labels force a monitoring strategy. Constraints chain forward. If you cannot tie a component to a specific constraint upstream, that component should not be in your design.

The second rule: business metric → ML objective → model → infrastructure. Most candidates draw an architecture diagram first and figure out what it optimizes later. The reverse is correct — define the business KPI (revenue per session), translate to an ML objective (P(click) × P(buy|click) × price as expected revenue), pick the model class that natively optimizes it (multi-task LightGBM with calibrated outputs), then design the infrastructure that serves it at the required scale and latency. This order eliminates 70% of the inconsistencies that show up in sloppy MLSD answers.

The 6-Phase MLSD Design Playbook

Phase 1 — Translate product into an ML problem type (5 min)

Pick exactly one: binary classification (fraud, click, conversion), multi-class classification (intent classification, content moderation labels), regression (price, watch time, time-to-event), learning-to-rank (search, recommendations), retrieval/embedding (semantic search, candidate generation), or sequence modeling (next-item, session-aware). Naming the type narrows the model space and signals you've shipped this before.

Phase 2 — Define offline + online metrics (3 min)

Offline: AUC/PR-AUC for classification, NDCG@K/MRR for ranking, RMSE/MAE for regression, recall@K for retrieval. Online: business KPI (revenue, retention, fraud loss prevented) plus 2-3 guard rails (latency, fairness, content diversity). State the gap explicitly: 'Offline AUC up does not guarantee revenue up — A/B test will validate.'

Phase 3 — Anchor scale and latency (3 min)

DAU, peak QPS, catalog size, latency budget for the ML slice (typical: 5-50ms), training data volume, label arrival latency, retraining cadence. These numbers force every downstream choice — e.g., 100K QPS with 10ms ML budget rules out CPU transformer inference.

Phase 4 — Lay down the canonical funnel (5 min)

For most user-facing ML systems: candidate generation (10M → 500, recall-oriented, two-tower or heuristic) → ranking (500 → 20, precision-oriented, GBDT or DLRM) → post-processing (diversity, business rules, A/B variant routing). For fraud/moderation: feature extraction → primary model → human-in-the-loop fallback for low-confidence. For pricing/forecasting: batch model training → daily/hourly inference → write to serving DB.

Phase 5 — Feature store + training pipeline (10 min)

Feature store with offline (S3 Parquet for training) + online (Redis / DynamoDB for serving) tiers, defined by shared feature definitions. Point-in-time correct joins for training set generation. Distributed training (Ray, Spark, Horovod) with experiment tracking (MLflow). Model registry as the gatekeeper between training and deployment.

Phase 6 — Serving + monitoring + A/B test (10 min)

Serving: pick the model server (Triton for GPU, TorchServe / TFServing for CPU, vLLM for LLMs, in-process for GBDT). A/B test infrastructure: traffic router, deterministic hash assignment, metric pipeline, statistical test. Monitoring: PSI for feature drift, output distribution shift, delayed-label precision tracking, auto-rollback. Retraining: scheduled + threshold-based + event-based triggers.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade