The canonical ML system design problem. Design YouTube's video recommendation engine end-to-end — from billion-scale candidate retrieval to transformer-based multi-task ranking to A/B experimentation. Covers the exact architecture used in production, latency budget breakdowns, negative sampling, feedback loop pathologies, and what each level (mid/senior/staff) must cover to pass.

65 min read 5 sections 1 interview questions

Recommendation SystemsTwo-Tower NetworksMulti-Task LearningCandidate GenerationFeature StoreFAISSGBDTTransformer RankingExplore-ExploitFeedback LoopsA/B TestingCold StartPosition Bias

Why Video Recommendation Is the Canonical MLSD Problem

Video recommendation sits at the intersection of every hard ML system design challenge simultaneously.

The scale is extreme: YouTube serves 2 billion logged-in users monthly, with 800 million videos in the catalog and over 500 hours of video uploaded every minute. The ML challenge is not just "pick a good model" — it's designing a system that can narrow 800 million candidates to 5–20 recommendations in under 200 milliseconds, while optimizing for a business objective that balances watch time, user satisfaction, creator sustainability, and platform health — objectives that frequently conflict.

This problem is asked at Google, Meta, TikTok/ByteDance, Netflix, Spotify, Amazon, Airbnb, LinkedIn, Pinterest, and Twitter. Variants include:

Homepage recommendations
"Up Next" while watching
Search ranking
Ads ranking
Feed ranking

The two-stage retrieval-and-ranking architecture covered here transfers directly to all of them.

TIP

What Interviewers Are Evaluating

Mid-level: Can you describe the multi-stage architecture and articulate why each stage exists? Do you know collaborative filtering vs content-based? Can you name standard evaluation metrics?

Senior-level: Can you design the data generation pipeline including negative sampling? Do you understand multi-task learning and why it outperforms single-task? Can you reason about the latency budget? Do you identify feedback loop risks proactively?

Staff-level: Do you treat training data quality as the real bottleneck (not model architecture)? Can you design the eval framework for retrieval quality separately from ranking quality? Do you design for system evolution — what does v1 look like vs v3? Do you reason about creator ecosystem health as a first-class system concern?

Clarifying Questions — Ask These First

What surface are we designing for?

Homepage recommendations (cold context) vs Up Next while watching (rich context: current video tells us a lot about intent) vs Search results (explicit query). The architecture differs significantly. For this problem: Up Next recommendations while watching a video.

What scale?

~1B DAU, ~800M videos, 200ms latency budget. These numbers constrain every architectural decision — they're the reason we need multi-stage retrieval.

What's the business model?

Ad-supported (YouTube, TikTok): watch time directly drives ad impressions and revenue. Subscription (Netflix): retention and return visits matter more than raw watch time. The business model shapes the ML objective.

Are there creator ecosystem constraints?

YouTube and TikTok must ensure new creators get discovery opportunities or the content supply dries up. This creates a diversity constraint that must be built into the system — not as an afterthought.

Any content policy or safety requirements?

Misinformation filtering, age-appropriate content, copyright — these affect the re-ranking layer. Ask if they're in scope.

From Business Objective to ML Objective — The Most Skipped Step

Most candidates jump straight to architecture. The framing is where you demonstrate senior-level thinking. The business objective (maximize YouTube revenue and user retention) must be translated into an ML objective carefully — the wrong translation produces systems that technically optimize well but harm the business.

Option A — Maximize CTR (click-through rate)

Easy to measure and optimize. Problem: drives clickbait. A thumbnail of a shocked person with a misleading title maximizes CTR. Users click, immediately leave (short watch time, low satisfaction), and long-term retention drops. YouTube A/B-tested pure CTR optimization and saw higher CTR but lower return visits. Verdict: bad.

Option B — Maximize total watch time

Better alignment with ad revenue. Problem: biases toward longer videos regardless of quality. Aggressively optimizes "rabbit holes" — addictive but low-quality content (conspiracy theories, outrage videos) increases watch time while reducing user satisfaction scores and, eventually, platform trust.

Option C — Maximize quality-adjusted watch time

Combine watch time with quality signals: completion rate, explicit satisfaction signals (survey ratings, likes), and absence of negative signals (immediate back-button, "not interested" clicks). Better aligned to sustainable engagement. This is the interview standard answer.

Option D — Multi-objective (senior/staff level)

Balance watch time, user satisfaction, creator sustainability, and content diversity simultaneously. Implemented as:

A multi-task model predicting all four signals
A value model at re-ranking that combines predictions with tunable weights
Weights tuned offline to maximize long-term business metrics in simulation

This is what production YouTube actually uses.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade