Skip to main content
ML System Design·Intermediate

A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls

How top ML teams run experiments that actually produce trustworthy conclusions — sample size calculation, randomization units, guard rails, CUPED variance reduction, network effects, and the organizational mistakes that make most A/B tests misleading.

35 min read 6 sections 6 interview questions
A/B TestingStatistical SignificanceHypothesis TestingSample SizeCUPEDVariance ReductionNetwork EffectsHoldoutInterleavingSwitchback ExperimentsNovelty EffectMultiple TestingFalse Discovery Rate

What Most Teams Get Wrong About A/B Testing

A/B testing for ML models is not the same as testing a UI change. ML model A/B tests have unique failure modes that invalidate conclusions at almost every mature ML team.

The four ways ML A/B tests mislead:

  1. Wrong randomization unit: Randomizing by session when measuring by user creates Simpson's Paradox. Randomizing by user when effects are social/collaborative creates contamination. The randomization unit must match the unit of analysis AND avoid interference between groups.

  2. Novelty effect: New model shows different content → users click more in weeks 1–2 simply because content is fresh. The underlying quality hasn't improved. Insufficient test duration (< 3 weeks) ships models based on novelty, not quality.

  3. Multiple comparisons: A team running 20 metrics with α=0.05 has a 64% chance of at least one false positive even if the model does nothing. Without correction, teams cherry-pick the one significant metric.

  4. Insufficient sample size: Pre-computing sample size is skipped. Test runs for 'a few days' until it 'looks good.' This is p-hacking — stopping when you see significance inflates false positive rate dramatically.

Knowing these failure modes and how to prevent them is what separates candidates who've worked with real experimental infrastructure from those who've only read about A/B testing.

The 5-Step ML A/B Test Design Process

01

Step 1: Define the experiment before starting

Primary metric: the ONE metric that determines if you ship. It must be causally linked to business value (not just a proxy). Guard-rail metrics: metrics that MUST NOT decrease (revenue, latency P99, error rate). Secondary metrics: directional signals worth monitoring. Minimum Detectable Effect (MDE): the smallest improvement that's business-meaningful. If 0.1% CTR lift is not worth the engineering effort to ship, don't stop the test at 0.1% lift even if significant. Significance level α = 0.05 and power 1-β = 0.80 (industry standard).

02

Step 2: Calculate sample size before running

Two-sample z-test formula for proportions: n = 2 × p(1-p) × (z_α/2 + z_β)² / δ², where p = baseline CTR, δ = MDE (absolute). Example: p=0.05 (5% CTR), MDE=0.5% absolute lift, α=0.05, power=80%: n = 2 × 0.05×0.95 × (1.96+0.84)² / 0.005² ≈ 20,000 users per variant. With 100K DAU and 5% treatment allocation: takes 2 days. With 1% MDE: n ≈ 500K users per variant — at 100K DAU, takes 10 days. Never start a test without this calculation.

03

Step 3: Choose the randomization unit carefully

Randomization unit = the entity assigned to treatment or control. Randomize by user_id for user-level effects. Use deterministic hash: variant = hash(user_id + experiment_id + salt) % 100. The same user must ALWAYS see the same variant — sticky assignment. Randomizing by session creates contamination (same user sees both variants across sessions). For marketplace systems (Uber, Airbnb): randomizing by user creates network effects because supply is shared. Use geo-based or time-based randomization instead.

04

Step 4: Set guard rails and define stopping rules in advance

Guard rails: 'Ship only if revenue/session doesn't decrease by > 0.5%, P99 latency doesn't increase by > 50ms, error rate doesn't increase by > 0.1%.' Pre-register these thresholds before looking at results. Define when you will stop early for harm: if any guard rail metric degrades past a threshold with p < 0.01, stop immediately. Do NOT stop early for success (peeking inflates false positive rate by 5-10×).

05

Step 5: Analyze results correctly

Apply Bonferroni correction or Benjamini-Hochberg FDR correction if testing multiple metrics (divide α by number of comparisons, or use FDR). Report effect size (relative lift %) alongside p-value — statistically significant doesn't mean practically significant. Run a sanity check: does the AA test (both variants are identical) show a balanced split and no metric difference? Use two-sided tests unless you have strong prior that the effect is positive. Practical significance threshold: effect must exceed MDE AND p < α.

Sample Size Calculation — The Numbers That Matter

Sample size calculation is the step that most candidates skip and most practitioners get wrong. Here's how to do it correctly for common ML metrics:

For proportions (CTR, conversion rate): n = (z_α/2 + z_β)² × 2p(1-p) / δ²

Where δ = absolute MDE. For α=0.05, power=80%: (z_α/2 + z_β)² = (1.96 + 0.84)² = 7.84.

Practical examples:

  • Baseline CTR 5%, MDE 0.5% absolute: n ≈ 20,000 per variant
  • Baseline CTR 5%, MDE 0.1% absolute: n ≈ 500,000 per variant
  • Baseline conversion rate 2%, MDE 0.1% absolute: n ≈ 2.4M per variant

For continuous metrics (revenue, session length): n = 2σ²(z_α/2 + z_β)² / δ²

Where σ is the standard deviation of the metric. Revenue is typically highly skewed (long tail from high-value users) → high σ → very large n required. CUPED (see below) reduces σ and thus reduces required n.

Key insight: MDE drives sample size quadratically. Halving MDE (detecting 0.5% lift instead of 1%) requires 4× more users. This is the fundamental tension between experiment sensitivity and speed.

CUPED — Variance Reduction for Faster, Smaller Experiments

CUPED (Controlled-experiment Using Pre-Experiment Data) is Microsoft's technique for dramatically reducing the variance of metric estimates without changing sample size — effectively making experiments 2–4× more sensitive (Deng, Xu, Kohavi, Walker, 2013 — https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf).

The insight: Much of the variance in a metric like revenue/session comes from pre-existing user heterogeneity — some users are always high spenders, others always low. This variance is not related to the treatment effect. We can 'subtract out' this pre-existing variance using pre-experiment covariate data.

CUPED adjustment: Y_CUPED = Y - θ × X_pre

Where:

  • Y = observed metric during experiment (revenue)
  • X_pre = same user's metric in a control period before the experiment
  • θ = Cov(Y, X_pre) / Var(X_pre) (regression coefficient)

Variance of Y_CUPED = Var(Y) × (1 - ρ²) where ρ is the correlation between Y and X_pre.

In practice: if a user's past-7-day revenue correlates with their in-experiment revenue at ρ=0.8, CUPED reduces variance by 1 - 0.64 = 36%. Same as running an experiment with 1.6× more users. At scale, CUPED lets teams run experiments 30–60% faster or detect smaller effects.

Implementation: compute X_pre for every user (their metric value in the 7 days before the experiment starts). Compute θ using the holdout set. Apply adjustment per user. Run standard two-sample t-test on Y_CUPED values.

CUPED is used in production at Microsoft, Airbnb (DiD), Netflix, and LinkedIn. It's a credible signal in interviews that you've worked with serious experimentation infrastructure.

⚠ WARNING

Network Effects — When A/B Tests Are Fundamentally Invalid

For two-sided marketplaces (Uber, Airbnb, LinkedIn Jobs) and social networks, standard user-level A/B tests are invalid because treatment and control groups are not independent.

Example: Uber tests a new pricing algorithm. Treatment users get lower prices. They take more rides. But the driver supply is shared with control users — so drivers in treatment areas are busy, leading to higher wait times for control users. Control users see a degraded experience because of the treatment. The ATE (average treatment effect) estimate is biased.

Geo-based experiments: Randomly assign geographic areas (cities, zip codes) to treatment or control. Supply and demand are more isolated geographically. Used by Uber, Lyft, Airbnb, DoorDash. Downside: fewer experimental units (cities vs users), so harder to achieve statistical power.

Switchback (time-based) experiments: Alternate between treatment and control over time periods within the same geographic area. Treatment window: 30 minutes. Control window: 30 minutes. Randomize which starts first. Reduces supply/demand contamination. Used by DoorDash, Instacart. Requires careful modeling of carryover effects.

For recommendation systems with social features: If treatment users see content from creators they're more likely to follow, and following generates content for all users → use holdout groups (a permanent 1–5% holdout never exposed to any experiment) to estimate the long-term total treatment effect.

A/B Test Infrastructure for ML Models

Rendering diagram...

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 17 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →