Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
ML System Design: Video Recommendation System
ML System Design
ML System Design: Instagram Feed Ranking System
ML System Design
ML System Design: 6-Step Framework
ML System Design
ML Evaluation Metrics: The Complete Guide
Machine Learning
A/B Testing & Experimentation at Scale
Machine Learning
ML System Design: Notification Ranking System
Design the notification ranking system used by LinkedIn, Instagram, and Reddit — end-to-end. Covers the three problems that make this uniquely hard: multi-objective optimization (engagement vs fatigue vs retention), user fatigue modeling with adaptive per-user budgets, and why the budget constraint is more important than the ranking model. Includes Instagram's diversity-aware demotion framework, LinkedIn's Decision Transformer for sequential notification policy, and the suppression feedback loop from sending too many notifications.
The Attention Economy Problem — Why Notifications Require Constrained Optimization
Notifications are the highest-bandwidth retention channel in any consumer application. They bypass the app entirely — they interrupt the user wherever they are. Losing this channel is permanent: once a user disables notifications, they almost never re-enable them. That's not a temporary engagement decline. It's the loss of your highest-reach channel for that user, forever.
This makes notification ranking fundamentally different from content recommendation. In video recommendations, the downside of a bad recommendation is a skipped video. In notifications, the downside of a bad sequence of recommendations is the user permanently closing the channel.
The naive ML answer — "train a CTR model and send notifications with high predicted click probability" — is exactly what creates notification fatigue. An engagement-maximizing model sends too frequently, users get annoyed, they disable notifications, and the platform loses the channel permanently.
Three problems that make this uniquely hard:
-
Multi-objective with conflicting signals. Maximizing immediate clicks requires sending notifications the user will click. Maximizing long-term retention requires not sending so many that the user gets annoyed and disables the channel. These objectives conflict: the action that maximizes short-term CTR often damages long-term retention.
-
User fatigue has temporal structure. Fatigue is not just "too many notifications." It builds within a session, within a day, and chronically over weeks. A user who received 10 notifications yesterday is more sensitive today. A user who dismissed the last 3 notifications is near the opt-out threshold. The system must model this state.
-
The budget is a decision, not a constant. Most interview prep treats the budget as "max N notifications/day per user." Production systems at LinkedIn and Reddit compute a per-user, adaptive budget that varies by engagement history, user segment, and estimated churn risk. The budget model is itself an ML problem.
This problem is asked at LinkedIn, Meta (Instagram, Facebook), Reddit, Twitter/X, Duolingo, Airbnb, Spotify, Snapchat, and any platform that sends push notifications at scale.
What Interviewers Are Evaluating
Mid-level: Knows that CTR-only optimization causes fatigue. Can describe basic two-tower retrieval for candidates. Knows multi-task learning for engagement. Understands a hard daily volume cap.
Senior-level: Designs per-user adaptive budget (not fixed cap). Models user fatigue state with explicit temporal features. Predicts opt-out probability as a first-class model head with highest loss weight. Applies diversity demotion (Instagram-style multiplicative penalty). Explains nearline architecture — why batching candidates is better than per-event evaluation. Can build a value model combining multiple objectives.
Staff-level: Frames as sequential decision problem and discusses RL/Decision Transformer approach. Designs budget model training using A/B experiments as ground truth. Proposes experiment design to isolate volume effects from content quality effects. Identifies the feedback loop in training data (sent notifications → labels → model learns to send more of same). Reasons about org ownership: who sets the opt-out guardrail threshold?
Clarifying Questions — Ask These First
What notification types are in scope?
Push notifications (phone lock screen), in-app badges, email, SMS. Each has different engagement patterns and different opt-out costs. For this design: push notifications — the highest-value, highest-risk channel.
What scale?
LinkedIn: 900M members, ~200M DAU, hundreds of notification candidates per user per day. Reddit: 50M DAU, millions of posts per day. Establish scale upfront — it drives whether nearline (batch-per-user) or real-time (per-event) is feasible.
What is the cost of notification opt-out in business terms?
Platforms must quantify: if a user disables notifications, what is the expected revenue loss? This number drives the opt-out guardrail threshold in A/B tests. If the answer is 'unknown,' that's a staff-level signal: you'd propose a holdout experiment to measure the causal impact of notification volume on 90-day retention.
What are the competing objectives?
Immediate engagement (click, upvote, reply)? Session generation (notification brings user back to app)? User satisfaction (user reports notification was useful)? Long-term retention (user still active 30 days later)? These must be explicitly ranked — they often conflict and the ranking determines value model weights.
Is there a legal constraint on notification frequency?
GDPR (EU), CAN-SPAM, and app store guidelines impose requirements on notification consent and frequency. Some platforms must honor 'quiet hours' per user preference. These become hard constraints in the serving layer.