Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender
ML System Design
Probability Calibration: When Model Scores Must Be Real Probabilities
ML System Design
Counterfactual Evaluation: IPS, SNIPS, and Doubly Robust Estimators
High-stakes production ML policy changes can't wait for a 2-week A/B test. This guide covers IPS, SNIPS, and Doubly Robust estimators used at Netflix, Pinterest, and Spotify — with propensity clipping, ESS diagnostics, and an OPE calibration framework.
The Core Problem: You Can't A/B Test Everything
The standard mental model for evaluating ML policy changes is A/B testing: split traffic, run both policies in parallel, measure outcomes, pick a winner. It's clean, unconfounded, and widely understood. But A/B testing has three failure modes that force production teams to use counterfactual evaluation instead.
Failure mode 1 — Bandwidth constraints: a mature recommendation system runs 50–200 concurrent A/B experiments. Each experiment needs dedicated traffic to reach statistical power. You simply cannot A/B test every candidate policy change — there isn't enough traffic. Pinterest's recommendation team, for instance, described in a 2022 engineering blog that they use off-policy evaluation to pre-filter thousands of policy candidates before running A/B tests on the top 5–10%.
Failure mode 2 — Irreversible or high-stakes decisions: changing a fraud detection policy, an ad pricing model, or a content moderation threshold can have immediate revenue and regulatory consequences. Running a live experiment that might serve 5 million users a harmful policy change for 2 weeks is not acceptable. The evaluation needs to happen on historical data.
Failure mode 3 — Slow feedback cycles: some outcomes (long-term retention, 30-day LTV, subscription conversion) take weeks or months to materialize. An A/B test measuring 30-day retention would take months. Counterfactual evaluation lets you estimate the long-term value of a policy change from existing logged data without waiting.
The setting: you have a logging policy π_0 (your current production system) that collected a dataset of (context, action, reward, propensity) tuples. You want to estimate the expected reward of a new policy π_1 without deploying it. This is the off-policy evaluation (OPE) problem.
What Staff-Level Interviews Test on Counterfactual Evaluation
At L5 (senior), interviewers expect you to know what counterfactual evaluation is and when A/B testing is insufficient. At L6+ (staff), they expect you to compare estimators by bias-variance tradeoff, know the propensity overlap assumption by name, and describe what propensity clipping does to bias vs. variance.
Signals that impress an L6 panel:
- Naming the overlap assumption (also called positivity): the new policy must not assign positive probability to actions that the logging policy never took
- Explaining why IPS is unbiased but high-variance, and why SNIPS reduces variance at the cost of mild bias
- Knowing the Doubly Robust estimator is unbiased if either the direct model or the propensity model is correct (double robustness property)
- Citing a concrete production example: Netflix uses DR estimators for offline evaluation of ranking policies; Spotify described IPS-based evaluation in their KDD 2020 paper on podcast recommendations
The one answer that fails instantly: "Off-policy evaluation is when you evaluate the model offline on a held-out test set." That's just offline evaluation. Off-policy evaluation specifically refers to correcting for the selection bias introduced by the logging policy.