Sections

0/9

Related Guides

A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls

ML System Design

35m

ML Monitoring & Drift Detection: Keeping Models Healthy in Production

ML System Design

30m

Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender

ML System Design

40m

ML Evaluation Metrics: The Complete Guide

Machine Learning

45m

Quiz

← Back to Library

ML System Design·Intermediate

Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It

The most common ML interview trap: candidates optimize offline metrics but can't explain why they diverge from online results. Covers AUC vs CTR, NDCG vs session length, position bias, novelty effects, counterfactual evaluation, and the right metric for each stage of an ML system.

30 min read 9 sections 6 interview questions

AUC-ROCNDCGMAPPrecision@KCTRConversion RateA/B TestingPosition BiasCounterfactual EvaluationIPSInterleavingOffline MetricsOnline MetricsGoodhart's Law

The Fundamental Gap Between Offline and Online Evaluation

The most common failure pattern in production ML: a model improves AUC by 0.02 in offline evaluation, gets deployed, and CTR drops 5%. Or the opposite: offline metrics are flat, but the A/B test shows a significant positive impact on revenue.

Why offline and online metrics diverge:

Distribution shift: Offline evaluation is on historical data collected by the previous model. The new model will see different distributions because it serves different items. The historical data doesn't reflect what the new model will actually show.
Position bias: Historical data records which items users clicked, but doesn't record that items in position 1 get ~3× more clicks than position 3, regardless of quality. An offline evaluation dataset contains this bias. A model that memorizes 'items the previous model showed in position 1 tend to get clicked' will score high offline but not improve the actual user experience.
Feedback loops: The training data was collected by a model that already had its own biases — popular items were shown more, so they got more clicks, so they became stronger training signals. Offline evaluation on this biased data doesn't measure what users genuinely prefer.
Novelty effect: When users see a new recommendation style, engagement spikes for 1–2 weeks (everything is fresh and novel) then returns to baseline or below. Offline metrics cannot capture this time-varying effect.
Goodhart's Law: When a metric becomes a target, it ceases to be a good measure. Optimizing NDCG@10 hard enough will produce a model that ranks items with historically high CTR at the top, regardless of whether they satisfy the current user's intent.

Evaluation Strategy — Choosing the Right Metrics for Each System Stage

Identify your system's primary objective and translate it to a measurable metric

The objective must be measurable from logs. 'Maximize user satisfaction' is not measurable. 'Maximize session length without increasing opt-out rate' is measurable (session length from server logs; opt-out from user settings events). Write down the primary metric, the guardrail metrics (what must not regress), and the diagnostic metrics (help you understand why) before you write a single model line.

Choose the offline metric that best proxies the online metric

Calibrate your offline metric against historical A/B tests to confirm it's predictive. If your last 5 A/B tests showed that NDCG@10 improvements of >0.5% correlated with CTR improvements in 4 of 5 cases, NDCG@10 is a useful proxy. If NDCG@10 improvements showed no correlation to CTR, do not use it as a go/no-go gate. Use it as a diagnostic only. This calibration exercise is what separates mature ML organizations from those perpetually surprised by offline-online gaps.

Design your offline eval set to match production distribution

The eval set must be constructed from the same query distribution as production. For recommendation: sample eval queries from the same time period, device mix, user segment distribution. Exclude queries that were over-represented in training due to data biases (e.g., popular items that dominate click logs). Use temporal holdout (train on days 1-28, eval on days 29-30) not random split — random splitting contaminates the eval set with items that were clicked before and after the eval timestamp.

Apply position debiasing before computing ranking metrics

Users click items at position 1 three times more often than position 3, regardless of quality. If your eval set was collected by a ranker that put some items at position 1 historically, those items look 'more relevant' in your eval data. Apply Inverse Propensity Scoring (IPS): weight each click by 1/P(item was shown at this position). This debiases the eval set toward revealed preferences, not positional artifacts.

Run A/B tests with sufficient duration to detect novelty and fatigue effects

Short A/B tests (1-2 weeks) capture the novelty effect — users engage more with anything new. Long tests (4+ weeks) show the steady-state impact. The rule of thumb: for user-facing ranking changes, run A/B tests for at least 2 full weeks, starting from Monday to Monday (controls for weekly seasonality). For notification frequency changes, run for 30 days minimum to capture fatigue effects. Auto-ship based on metrics only after the novelty effect window has passed.

Offline Metrics — What They Measure and When They Fail

Metric	Formula	What It Measures	When to Use	Failure Mode
AUC-ROC	Area under ROC curve	Ability to rank positives above negatives	Binary classification (fraud, churn)	Insensitive to calibration; a model that swaps any positive above any negative improves AUC even if absolute scores are wrong
AUC-PR	Area under Precision-Recall curve	Performance on imbalanced datasets (positives are rare)	Fraud (1% positive rate), medical diagnosis	Dominated by precision at low recall; can miss recall improvements
NDCG@K	Discounted Cumulative Gain normalized	Ranking quality at top-K positions	Search, recommendation ranking	Position 1 gets log(2)/log(2)=1 weight, position 5 gets 0.43 — heavily top-skewed; misses improvements below K
MAP	Mean Average Precision	Precision at all relevant recall points	Information retrieval, multi-label classification	Punishes any relevant item not retrieved; overly strict for recommendation
Precision@K	Relevant items in top K / K	Fraction of top-K that are relevant	Search result quality	Doesn't account for ordering within top-K
Recall@K	Relevant items in top K / total relevant	Coverage of relevant items	Retrieval stage evaluation	Ignores irrelevant items in the top K
Calibration (ECE)	Expected Calibration Error	Whether predicted probabilities match actual rates	When scores are used as probabilities (fraud p, bid price)	Not measured by ranking metrics; a perfectly ranked but miscalibrated model has AUC=1 and ECE > 0

Online Metrics — Business Outcomes and Their Proxies

Metric	What It Measures	Aligned With	Failure Mode / Caveat
CTR	Click-through rate: clicks / impressions	Short-term engagement	Clickbait maximizes CTR while degrading satisfaction. Never optimize CTR alone.
Conversion rate	Purchases (or signups) / clicks	Revenue generation	Can be confounded by pricing changes, promotions. Use incrementality experiments.
Session depth	Items consumed per session	Content discovery	A/B test needed — easy to game by auto-playing videos (forced consumption).
Watch-through rate	% of video content watched	Content quality / satisfaction	Strong signal — hard to fake. Correlates with long-term retention.
30-day retention	% users active 30 days after first exposure	Long-term product health	Delayed signal. A/B tests need 30+ days. Hard to use for fast iteration.
Revenue per session	Revenue generated in a user session	Business bottom line	Noisy for individual sessions; needs large samples. Use as guard-rail, not primary.
Reformulation rate	% queries where user modifies query after seeing results	Search result quality (lower = better)	Excellent search metric — directly measures dissatisfaction.

Position Bias — The Offline Evaluation Contaminator

Position bias is why offline metrics systematically mislead you about ranking model quality.

The problem: in your training/evaluation data, items that were shown in position 1 received 3× more clicks than the same items shown in position 5 — not because they were more relevant, but because users are more likely to click whatever is at the top. A model trained on this data learns: 'whatever the previous model put in position 1 tends to get clicked.' This is positional knowledge, not relevance knowledge.

How to detect it: Split your offline evaluation dataset by position. If items at position 1 have consistently higher offline metric scores than items at position 5 (even when controlling for item quality), you have position bias in your training data.

Inverse Propensity Scoring (IPS) — the standard debiasing fix:

For each (user, item, position) training example, weight the example by 1/P(click | position). Items shown at position 1 (high P(click | position)) are down-weighted. Items shown at position 5 (low P(click | position)) are up-weighted. This corrects for the measurement bias introduced by the position-dependent click probability.

# IPS debiasing in training loss
click_probability_by_position = {1: 0.15, 2: 0.09, 3: 0.07, 4: 0.05, 5: 0.04, ...}
for position, label, model_score in training_examples:
    propensity = click_probability_by_position[position]
    ips_weight = 1.0 / propensity  # Inverse of propensity
    loss += ips_weight * cross_entropy(model_score, label)

Alternatively: add position as an explicit training feature (position embedding or position index as input). Zero it out at serving time. The model learns the positional click bias during training but serves position-agnostic predictions.

Counterfactual Evaluation — Measuring Offline What Only A/B Tests Could

Counterfactual evaluation answers: 'How would a new policy have performed on historical data, corrected for the exploration policy used to collect that data?'

The standard approach is Doubly Robust (DR) estimation or IPS-based off-policy evaluation (OPE):

DR estimator: V_DR = (1/N) Σ [r(a, x) / π(a|x) - (r̂(a, x) / π(a|x)) + v̂(x)]

Where:

r(a, x): observed reward (click) for action a (item shown) in context x
π(a|x): probability that the logging policy showed item a in context x (must be known)
r̂(a, x): reward model's prediction for this (action, context) pair
v̂(x): direct model estimate of expected reward

This is doubly robust — it's correct if either the reward model OR the logging propensities are correct.

Practical requirement: This only works if you log the propensity scores of the logging policy for every served item. If you don't log π(a|x), you cannot compute IPS or DR. This is why mature recommendation teams log propensities alongside every impression: it enables offline evaluation of new policies without running an A/B test.

Interleaving — a faster alternative to full A/B tests for ranking: Instead of routing users to either model A or model B, interleave the results from both models into a single ranked list for each user. Track which model's items get more clicks. Statistically much more efficient than A/B — requires 10–100× fewer users to reach the same confidence. Used heavily at Netflix and LinkedIn for ranking evaluation.

Evaluation Strategy by System Stage

Rendering diagram...

⚠ WARNING

The Novelty Effect — Why 1-Week A/B Tests Mislead

When a new recommendation model shows different (and better-ranked) content, users engage more for the first 1–2 weeks purely because everything is new and fresh. This inflates CTR and session metrics during the A/B test window.

After the novelty effect fades (week 3–4), engagement may return to baseline or even drop if the new model's underlying quality isn't better. Companies that ship models based on 1-week A/B test results regularly end up reverting 4 weeks later.

Fix: run A/B tests for at least 3–4 weeks before declaring significance. For major recommendation overhauls (new model architecture, not just a feature addition): run for 6–8 weeks. Look at the trend of metrics over the test period — is CTR still rising week 3? Or did it peak week 1 and come back down? A flattening or recovering metric after week 2 suggests a novelty effect, not a genuine quality improvement.

TIP

Metric Selection Framework for Interviews

In every ML system design interview, you'll need to specify both offline and online metrics. Use this structure:

Offline metrics (tell interviewers 'what I'm optimizing in training'):

Binary classification (fraud, spam, churn): AUC-PR (not AUC-ROC — data is imbalanced)
Retrieval: Recall@K
Ranking: NDCG@10 for search, Precision@K for recommendation
Regression (ETA prediction, bid price): MAE, MAPE, RMSE (choose based on tail sensitivity)
Calibration: add ECE when predicted scores are used directly (not just for ranking)

Online metrics (tell interviewers 'how I measure success in production'):

Primary metric: directly tied to user value and business outcome
Guard-rail metrics (must NOT regress): revenue, latency P99, error rate
Secondary metrics: leading indicators that move before primary metric

Always mention: 'I'll optimize the offline metric but validate that improvements transfer to online metrics in A/B tests, since offline and online metrics frequently diverge due to position bias, feedback loops, and novelty effects.'

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.