Sections
Related Guides
Statistics & Probability Foundations
Machine Learning
Statistical Power, Sample Size & Experiment Design: The Complete Guide
Machine Learning
Causal Inference: DiD, Instrumental Variables, RDD, and When A/B Tests Fail
Machine Learning
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
A/B Test Critique: Finding Flaws in Experiment Designs
Production Engineering
A/B Testing & Experimentation at Scale
End-to-end A/B testing framework used at top tech companies — from experiment design and sample size calculation to statistical analysis, multiple comparisons, novelty effects, and causal inference when randomization isn't possible.
A/B Test Lifecycle — The 6-Step Framework
The Experiment Lifecycle
Every A/B test at a tech company follows the same 6-step lifecycle. Skipping any step is a common source of invalid results.
- Hypothesis → 2. Power Analysis (sample size) → 3. Randomization design → 4. Instrumentation → 5. Statistical analysis → 6. Decision
Step 1 — Define What You're Testing
Primary Metric
The single metric that determines success. Choose one that's directly linked to business value and sensitive enough to detect the change. Avoid vanity metrics (page views) and composite metrics.
Guardrail Metrics
Metrics that must not regress. E.g., latency p99, crash rate, subscription cancellation. A positive primary metric doesn't override a guardrail regression.
Secondary Metrics
Supporting evidence. Track to understand mechanism, not for shipping decision. Reduces multiple comparisons risk.
Hypothesis Statement
State it as a sentence: "Treatment X will increase primary metric Y by at least Z% among population P." The Z% is your Minimum Detectable Effect (MDE) — the smallest effect that matters to the business.
Step 2 — Sample Size Calculation
Calculate required sample size BEFORE running. Running until p < 0.05 is p-hacking. Formula: n = 2 × (z_α/2 + z_β)² × σ² / δ² Parameters: α = 0.05 (significance), β = 0.20 (20% false negative → 80% power), σ = metric std dev, δ = MDE (minimum effect you care about detecting). Example: Conversion rate baseline 10%, MDE = 1% absolute, σ = √(0.1×0.9) = 0.3 → n ≈ 14,100 per variant → 28,200 total users. CUPED variance reduction: if you have pre-experiment metric data, you can reduce effective variance by 20-50% using CUPED, proportionally shortening experiment duration.
Unit of Randomization Trade-offs
| Unit | Best For | Risk |
|---|---|---|
| User ID | Feature changes and UI tests | Long-term carry-over between experiments |
| Session ID | Session-level features (search / ads per session) | Same user can see both variants across sessions |
| Request ID | Infrastructure A/B tests | Very noisy — variance is high |
| Device ID | Mobile experiments | Cross-device users appear in both variants |
| Cluster (city/region) | Features with network effects | Much larger sample sizes needed |
Sample Ratio Mismatch (SRM)
Before analyzing results, verify the actual control/treatment split matches the intended split using a chi-square test. If you assigned 50/50 but got 48/52, something is wrong with the randomization or logging — results are invalid. Common causes: client-side filtering (mobile users exit before seeing treatment), bot traffic handled differently, logging bugs that drop some impressions.
Step 5 — Statistical Analysis
Check SRM first
chi-square test on user counts. If p < 0.05, stop — investigate the imbalance before interpreting results.
Compute test statistic
For proportions (CTR): two-proportion z-test. For means (revenue): Welch's t-test. For counts: Poisson rate test.
Compute p-value and confidence interval
Report CI, not just p-value. CI gives the range of plausible effect sizes, essential for practical significance judgment.
Apply multiple comparison correction
If testing k metrics, use Bonferroni (α/k) or Benjamini-Hochberg FDR correction. Guardrail metrics also count.
Check practical significance
Is the effect size large enough to justify the engineering and product cost of shipping? A 0.001% lift is rarely worth shipping.
Common A/B Testing Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Peeking / optional stopping | Appears significant but later regresses | Pre-commit to sample size; don't check daily |
| Novelty effect | Metrics spike and decay over 1-2 weeks | Run minimum 2 weeks; analyze week 2 separately |
| Network effects | Control contaminated by treatment | Cluster randomization by social graph or geography |
| Survivorship bias | Only analyzing engaged users | Include all users assigned to experiment |
| Simpson's Paradox | Aggregate result opposite of subgroup results | Stratify analysis by major segments |
| Multiple metrics | Some metrics significant by chance | Bonferroni correction; pre-specify primary metric |
Common A/B Testing Pitfalls — Visual Guide
Advanced: CUPED (Variance Reduction)
CUPED uses pre-experiment data to reduce variance of the metric estimate, enabling shorter experiments. Y_cuped = Y_post - θ × (X_pre - E[X_pre]) where θ = Cov(Y_post, X_pre) / Var(X_pre) and X_pre is the pre-experiment value of the same metric. Typical variance reduction: 20-50%. If original experiment needs 14,000 users, CUPED might reduce this to 7,000-11,000 — cutting experiment runtime in half. Requirement: you need pre-experiment measurements for the same metric. Works best for engagement metrics (session time, revenue) where historical user behavior is predictive.
When You Can't Randomize — Quasi-Experiments
Sometimes A/B tests are impossible: feature launched to all users, policy change affecting everyone, retroactive analysis. Quasi-experimental methods: Difference-in-Differences (DiD): Compare treated vs control across pre/post periods. Requires "parallel trends" assumption — pre-period trends were similar. Regression Discontinuity (RDD): Treatment assigned by threshold (e.g., users with score ≥ 700 get feature). Compare users just above/below threshold — they're virtually identical. Synthetic Control: Construct a weighted combination of control units that matches the treated unit's pre-period trajectory. Then measure divergence after treatment.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →Start Solving
You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.
Open Coding Problem →