Sections
Related Guides
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender
ML System Design
ML Evaluation Metrics: The Complete Guide
Machine Learning
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
The most common ML interview trap: candidates optimize offline metrics but can't explain why they diverge from online results. Covers AUC vs CTR, NDCG vs session length, position bias, novelty effects, counterfactual evaluation, and the right metric for each stage of an ML system.
The Fundamental Gap Between Offline and Online Evaluation
The most common failure pattern in production ML: a model improves AUC by 0.02 in offline evaluation, gets deployed, and CTR drops 5%. Or the opposite: offline metrics are flat, but the A/B test shows a significant positive impact on revenue.
Why offline and online metrics diverge:
-
Distribution shift: Offline evaluation is on historical data collected by the previous model. The new model will see different distributions because it serves different items. The historical data doesn't reflect what the new model will actually show.
-
Position bias: Historical data records which items users clicked, but doesn't record that items in position 1 get ~3× more clicks than position 3, regardless of quality. An offline evaluation dataset contains this bias. A model that memorizes 'items the previous model showed in position 1 tend to get clicked' will score high offline but not improve the actual user experience.
-
Feedback loops: The training data was collected by a model that already had its own biases — popular items were shown more, so they got more clicks, so they became stronger training signals. Offline evaluation on this biased data doesn't measure what users genuinely prefer.
-
Novelty effect: When users see a new recommendation style, engagement spikes for 1–2 weeks (everything is fresh and novel) then returns to baseline or below. Offline metrics cannot capture this time-varying effect.
-
Goodhart's Law: When a metric becomes a target, it ceases to be a good measure. Optimizing NDCG@10 hard enough will produce a model that ranks items with historically high CTR at the top, regardless of whether they satisfy the current user's intent.
Evaluation Strategy — Choosing the Right Metrics for Each System Stage
Identify your system's primary objective and translate it to a measurable metric
The objective must be measurable from logs. 'Maximize user satisfaction' is not measurable. 'Maximize session length without increasing opt-out rate' is measurable (session length from server logs; opt-out from user settings events). Write down the primary metric, the guardrail metrics (what must not regress), and the diagnostic metrics (help you understand why) before you write a single model line.
Choose the offline metric that best proxies the online metric
Calibrate your offline metric against historical A/B tests to confirm it's predictive. If your last 5 A/B tests showed that NDCG@10 improvements of >0.5% correlated with CTR improvements in 4 of 5 cases, NDCG@10 is a useful proxy. If NDCG@10 improvements showed no correlation to CTR, do not use it as a go/no-go gate. Use it as a diagnostic only. This calibration exercise is what separates mature ML organizations from those perpetually surprised by offline-online gaps.
Design your offline eval set to match production distribution
The eval set must be constructed from the same query distribution as production. For recommendation: sample eval queries from the same time period, device mix, user segment distribution. Exclude queries that were over-represented in training due to data biases (e.g., popular items that dominate click logs). Use temporal holdout (train on days 1-28, eval on days 29-30) not random split — random splitting contaminates the eval set with items that were clicked before and after the eval timestamp.
Apply position debiasing before computing ranking metrics
Users click items at position 1 three times more often than position 3, regardless of quality. If your eval set was collected by a ranker that put some items at position 1 historically, those items look 'more relevant' in your eval data. Apply Inverse Propensity Scoring (IPS): weight each click by 1/P(item was shown at this position). This debiases the eval set toward revealed preferences, not positional artifacts.
Run A/B tests with sufficient duration to detect novelty and fatigue effects
Short A/B tests (1-2 weeks) capture the novelty effect — users engage more with anything new. Long tests (4+ weeks) show the steady-state impact. The rule of thumb: for user-facing ranking changes, run A/B tests for at least 2 full weeks, starting from Monday to Monday (controls for weekly seasonality). For notification frequency changes, run for 30 days minimum to capture fatigue effects. Auto-ship based on metrics only after the novelty effect window has passed.
Offline Metrics — What They Measure and When They Fail
| Metric | Formula | What It Measures | When to Use | Failure Mode |
|---|---|---|---|---|
| AUC-ROC | Area under ROC curve | Ability to rank positives above negatives | Binary classification (fraud, churn) | Insensitive to calibration; a model that swaps any positive above any negative improves AUC even if absolute scores are wrong |
| AUC-PR | Area under Precision-Recall curve | Performance on imbalanced datasets (positives are rare) | Fraud (1% positive rate), medical diagnosis | Dominated by precision at low recall; can miss recall improvements |
| NDCG@K | Discounted Cumulative Gain normalized | Ranking quality at top-K positions | Search, recommendation ranking | Position 1 gets log(2)/log(2)=1 weight, position 5 gets 0.43 — heavily top-skewed; misses improvements below K |
| MAP | Mean Average Precision | Precision at all relevant recall points | Information retrieval, multi-label classification | Punishes any relevant item not retrieved; overly strict for recommendation |
| Precision@K | Relevant items in top K / K | Fraction of top-K that are relevant | Search result quality | Doesn't account for ordering within top-K |
| Recall@K | Relevant items in top K / total relevant | Coverage of relevant items | Retrieval stage evaluation | Ignores irrelevant items in the top K |
| Calibration (ECE) | Expected Calibration Error | Whether predicted probabilities match actual rates | When scores are used as probabilities (fraud p, bid price) | Not measured by ranking metrics; a perfectly ranked but miscalibrated model has AUC=1 and ECE > 0 |
Online Metrics — Business Outcomes and Their Proxies
| Metric | What It Measures | Aligned With | Failure Mode / Caveat |
|---|---|---|---|
| CTR | Click-through rate: clicks / impressions | Short-term engagement | Clickbait maximizes CTR while degrading satisfaction. Never optimize CTR alone. |
| Conversion rate | Purchases (or signups) / clicks | Revenue generation | Can be confounded by pricing changes, promotions. Use incrementality experiments. |
| Session depth | Items consumed per session | Content discovery | A/B test needed — easy to game by auto-playing videos (forced consumption). |
| Watch-through rate | % of video content watched | Content quality / satisfaction | Strong signal — hard to fake. Correlates with long-term retention. |
| 30-day retention | % users active 30 days after first exposure | Long-term product health | Delayed signal. A/B tests need 30+ days. Hard to use for fast iteration. |
| Revenue per session | Revenue generated in a user session | Business bottom line | Noisy for individual sessions; needs large samples. Use as guard-rail, not primary. |
| Reformulation rate | % queries where user modifies query after seeing results | Search result quality (lower = better) | Excellent search metric — directly measures dissatisfaction. |
Position Bias — The Offline Evaluation Contaminator
Position bias is why offline metrics systematically mislead you about ranking model quality.
The problem: in your training/evaluation data, items that were shown in position 1 received 3× more clicks than the same items shown in position 5 — not because they were more relevant, but because users are more likely to click whatever is at the top. A model trained on this data learns: 'whatever the previous model put in position 1 tends to get clicked.' This is positional knowledge, not relevance knowledge.
How to detect it: Split your offline evaluation dataset by position. If items at position 1 have consistently higher offline metric scores than items at position 5 (even when controlling for item quality), you have position bias in your training data.
Inverse Propensity Scoring (IPS) — the standard debiasing fix:
For each (user, item, position) training example, weight the example by 1/P(click | position). Items shown at position 1 (high P(click | position)) are down-weighted. Items shown at position 5 (low P(click | position)) are up-weighted. This corrects for the measurement bias introduced by the position-dependent click probability.
# IPS debiasing in training loss
click_probability_by_position = {1: 0.15, 2: 0.09, 3: 0.07, 4: 0.05, 5: 0.04, ...}
for position, label, model_score in training_examples:
propensity = click_probability_by_position[position]
ips_weight = 1.0 / propensity # Inverse of propensity
loss += ips_weight * cross_entropy(model_score, label)
Alternatively: add position as an explicit training feature (position embedding or position index as input). Zero it out at serving time. The model learns the positional click bias during training but serves position-agnostic predictions.
Counterfactual Evaluation — Measuring Offline What Only A/B Tests Could
Counterfactual evaluation answers: 'How would a new policy have performed on historical data, corrected for the exploration policy used to collect that data?'
The standard approach is Doubly Robust (DR) estimation or IPS-based off-policy evaluation (OPE):
DR estimator: V_DR = (1/N) Σ [r(a, x) / π(a|x) - (r̂(a, x) / π(a|x)) + v̂(x)]
Where:
r(a, x): observed reward (click) for actiona(item shown) in contextxπ(a|x): probability that the logging policy showed itemain contextx(must be known)r̂(a, x): reward model's prediction for this (action, context) pairv̂(x): direct model estimate of expected reward
This is doubly robust — it's correct if either the reward model OR the logging propensities are correct.
Practical requirement: This only works if you log the propensity scores of the logging policy for every served item. If you don't log π(a|x), you cannot compute IPS or DR. This is why mature recommendation teams log propensities alongside every impression: it enables offline evaluation of new policies without running an A/B test.
Interleaving — a faster alternative to full A/B tests for ranking: Instead of routing users to either model A or model B, interleave the results from both models into a single ranked list for each user. Track which model's items get more clicks. Statistically much more efficient than A/B — requires 10–100× fewer users to reach the same confidence. Used heavily at Netflix and LinkedIn for ranking evaluation.
Evaluation Strategy by System Stage
The Novelty Effect — Why 1-Week A/B Tests Mislead
When a new recommendation model shows different (and better-ranked) content, users engage more for the first 1–2 weeks purely because everything is new and fresh. This inflates CTR and session metrics during the A/B test window.
After the novelty effect fades (week 3–4), engagement may return to baseline or even drop if the new model's underlying quality isn't better. Companies that ship models based on 1-week A/B test results regularly end up reverting 4 weeks later.
Fix: run A/B tests for at least 3–4 weeks before declaring significance. For major recommendation overhauls (new model architecture, not just a feature addition): run for 6–8 weeks. Look at the trend of metrics over the test period — is CTR still rising week 3? Or did it peak week 1 and come back down? A flattening or recovering metric after week 2 suggests a novelty effect, not a genuine quality improvement.
Metric Selection Framework for Interviews
In every ML system design interview, you'll need to specify both offline and online metrics. Use this structure:
Offline metrics (tell interviewers 'what I'm optimizing in training'):
- Binary classification (fraud, spam, churn): AUC-PR (not AUC-ROC — data is imbalanced)
- Retrieval: Recall@K
- Ranking: NDCG@10 for search, Precision@K for recommendation
- Regression (ETA prediction, bid price): MAE, MAPE, RMSE (choose based on tail sensitivity)
- Calibration: add ECE when predicted scores are used directly (not just for ranking)
Online metrics (tell interviewers 'how I measure success in production'):
- Primary metric: directly tied to user value and business outcome
- Guard-rail metrics (must NOT regress): revenue, latency P99, error rate
- Secondary metrics: leading indicators that move before primary metric
Always mention: 'I'll optimize the offline metric but validate that improvements transfer to online metrics in A/B tests, since offline and online metrics frequently diverge due to position bias, feedback loops, and novelty effects.'
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →