High-stakes production ML policy changes can't wait for a 2-week A/B test. This guide covers IPS, SNIPS, and Doubly Robust estimators used at Netflix, Pinterest, and Spotify — with propensity clipping, ESS diagnostics, and an OPE calibration framework.

35 min read 3 sections 1 interview questions

Off-Policy EvaluationCounterfactual EvaluationInverse Propensity ScoringIPSSNIPSDoubly Robust EstimatorCausal InferenceBandit AlgorithmsRecommendation SystemsPolicy EvaluationLogging BiasPropensity ClippingNetflixSpotify

The Core Problem: You Can't A/B Test Everything

The standard mental model for evaluating ML policy changes is A/B testing: split traffic, run both policies in parallel, measure outcomes, pick a winner. It's clean, unconfounded, and widely understood. But A/B testing has three failure modes that force production teams to use counterfactual evaluation instead.

Failure mode 1 — Bandwidth constraints: a mature recommendation system runs 50–200 concurrent A/B experiments. Each experiment needs dedicated traffic to reach statistical power. You simply cannot A/B test every candidate policy change — there isn't enough traffic. Pinterest's recommendation team, for instance, described in a 2022 engineering blog that they use off-policy evaluation to pre-filter thousands of policy candidates before running A/B tests on the top 5–10%.

Failure mode 2 — Irreversible or high-stakes decisions: changing a fraud detection policy, an ad pricing model, or a content moderation threshold can have immediate revenue and regulatory consequences. Running a live experiment that might serve 5 million users a harmful policy change for 2 weeks is not acceptable. The evaluation needs to happen on historical data.

Failure mode 3 — Slow feedback cycles: some outcomes (long-term retention, 30-day LTV, subscription conversion) take weeks or months to materialize. An A/B test measuring 30-day retention would take months. Counterfactual evaluation lets you estimate the long-term value of a policy change from existing logged data without waiting.

The setting: you have a logging policy π_0 (your current production system) that collected a dataset of (context, action, reward, propensity) tuples. You want to estimate the expected reward of a new policy π_1 without deploying it. This is the off-policy evaluation (OPE) problem.

IMPORTANT

What Staff-Level Interviews Test on Counterfactual Evaluation

At L5 (senior), interviewers expect you to know what counterfactual evaluation is and when A/B testing is insufficient. At L6+ (staff), they expect you to compare estimators by bias-variance tradeoff, know the propensity overlap assumption by name, and describe what propensity clipping does to bias vs. variance.

Signals that impress an L6 panel:

Naming the overlap assumption (also called positivity): the new policy must not assign positive probability to actions that the logging policy never took
Explaining why IPS is unbiased but high-variance, and why SNIPS reduces variance at the cost of mild bias
Knowing the Doubly Robust estimator is unbiased if either the direct model or the propensity model is correct (double robustness property)
Citing a concrete production example: Netflix uses DR estimators for offline evaluation of ranking policies; Spotify described IPS-based evaluation in their KDD 2020 paper on podcast recommendations

The one answer that fails instantly: "Off-policy evaluation is when you evaluate the model offline on a held-out test set." That's just offline evaluation. Off-policy evaluation specifically refers to correcting for the selection bias introduced by the logging policy.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade