Skip to main content
Machine Learning·Intermediate

A/B Testing & Experimentation at Scale

End-to-end A/B testing framework used at top tech companies — from experiment design and sample size calculation to statistical analysis, multiple comparisons, novelty effects, and causal inference when randomization isn't possible.

70 min read 11 sections 10 interview questions
A/B TestingExperimentationHypothesis TestingSample SizeCUPEDCausal InferenceInterleavingStatistical SignificanceNovelty EffectMultiple Testingp-valueConversion Rate Optimization

A/B Test Lifecycle — The 6-Step Framework

Rendering diagram...

The Experiment Lifecycle

Every A/B test at a tech company follows the same 6-step lifecycle. Skipping any step is a common source of invalid results.

  1. Hypothesis → 2. Power Analysis (sample size) → 3. Randomization design → 4. Instrumentation → 5. Statistical analysis → 6. Decision

Step 1 — Define What You're Testing

01

Primary Metric

The single metric that determines success. Choose one that's directly linked to business value and sensitive enough to detect the change. Avoid vanity metrics (page views) and composite metrics.

02

Guardrail Metrics

Metrics that must not regress. E.g., latency p99, crash rate, subscription cancellation. A positive primary metric doesn't override a guardrail regression.

03

Secondary Metrics

Supporting evidence. Track to understand mechanism, not for shipping decision. Reduces multiple comparisons risk.

04

Hypothesis Statement

State it as a sentence: "Treatment X will increase primary metric Y by at least Z% among population P." The Z% is your Minimum Detectable Effect (MDE) — the smallest effect that matters to the business.

Step 2 — Sample Size Calculation

Calculate required sample size BEFORE running. Running until p < 0.05 is p-hacking. Formula: n = 2 × (z_α/2 + z_β)² × σ² / δ² Parameters: α = 0.05 (significance), β = 0.20 (20% false negative → 80% power), σ = metric std dev, δ = MDE (minimum effect you care about detecting). Example: Conversion rate baseline 10%, MDE = 1% absolute, σ = √(0.1×0.9) = 0.3 → n ≈ 14,100 per variant → 28,200 total users. CUPED variance reduction: if you have pre-experiment metric data, you can reduce effective variance by 20-50% using CUPED, proportionally shortening experiment duration.

Unit of Randomization Trade-offs

UnitBest ForRisk
User IDFeature changes and UI testsLong-term carry-over between experiments
Session IDSession-level features (search / ads per session)Same user can see both variants across sessions
Request IDInfrastructure A/B testsVery noisy — variance is high
Device IDMobile experimentsCross-device users appear in both variants
Cluster (city/region)Features with network effectsMuch larger sample sizes needed
⚠ WARNING

Sample Ratio Mismatch (SRM)

Before analyzing results, verify the actual control/treatment split matches the intended split using a chi-square test. If you assigned 50/50 but got 48/52, something is wrong with the randomization or logging — results are invalid. Common causes: client-side filtering (mobile users exit before seeing treatment), bot traffic handled differently, logging bugs that drop some impressions.

Step 5 — Statistical Analysis

01

Check SRM first

chi-square test on user counts. If p < 0.05, stop — investigate the imbalance before interpreting results.

02

Compute test statistic

For proportions (CTR): two-proportion z-test. For means (revenue): Welch's t-test. For counts: Poisson rate test.

03

Compute p-value and confidence interval

Report CI, not just p-value. CI gives the range of plausible effect sizes, essential for practical significance judgment.

04

Apply multiple comparison correction

If testing k metrics, use Bonferroni (α/k) or Benjamini-Hochberg FDR correction. Guardrail metrics also count.

05

Check practical significance

Is the effect size large enough to justify the engineering and product cost of shipping? A 0.001% lift is rarely worth shipping.

Common A/B Testing Pitfalls

PitfallSymptomFix
Peeking / optional stoppingAppears significant but later regressesPre-commit to sample size; don't check daily
Novelty effectMetrics spike and decay over 1-2 weeksRun minimum 2 weeks; analyze week 2 separately
Network effectsControl contaminated by treatmentCluster randomization by social graph or geography
Survivorship biasOnly analyzing engaged usersInclude all users assigned to experiment
Simpson's ParadoxAggregate result opposite of subgroup resultsStratify analysis by major segments
Multiple metricsSome metrics significant by chanceBonferroni correction; pre-specify primary metric

Common A/B Testing Pitfalls — Visual Guide

Rendering diagram...

Advanced: CUPED (Variance Reduction)

CUPED uses pre-experiment data to reduce variance of the metric estimate, enabling shorter experiments. Y_cuped = Y_post - θ × (X_pre - E[X_pre]) where θ = Cov(Y_post, X_pre) / Var(X_pre) and X_pre is the pre-experiment value of the same metric. Typical variance reduction: 20-50%. If original experiment needs 14,000 users, CUPED might reduce this to 7,000-11,000 — cutting experiment runtime in half. Requirement: you need pre-experiment measurements for the same metric. Works best for engagement metrics (session time, revenue) where historical user behavior is predictive.

When You Can't Randomize — Quasi-Experiments

Sometimes A/B tests are impossible: feature launched to all users, policy change affecting everyone, retroactive analysis. Quasi-experimental methods: Difference-in-Differences (DiD): Compare treated vs control across pre/post periods. Requires "parallel trends" assumption — pre-period trends were similar. Regression Discontinuity (RDD): Treatment assigned by threshold (e.g., users with score ≥ 700 get feature). Compare users just above/below threshold — they're virtually identical. Synthetic Control: Construct a weighted combination of control units that matches the treated unit's pre-period trajectory. Then measure divergence after treatment.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →
Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →