Sections

End-to-end A/B testing framework used at top tech companies — from experiment design and sample size calculation to statistical analysis, multiple comparisons, novelty effects, and causal inference when randomization isn't possible.

70 min read 11 sections 10 interview questions

A/B TestingExperimentationHypothesis TestingSample SizeCUPEDCausal InferenceInterleavingStatistical SignificanceNovelty EffectMultiple Testingp-valueConversion Rate Optimization

A/B Test Lifecycle — The 6-Step Framework

Rendering diagram...

The Experiment Lifecycle

Every A/B test at a tech company follows the same 6-step lifecycle. Skipping any step is a common source of invalid results.

Hypothesis → 2. Power Analysis (sample size) → 3. Randomization design → 4. Instrumentation → 5. Statistical analysis → 6. Decision

Step 1 — Define What You're Testing

Primary Metric

The single metric that determines success. Choose one that's directly linked to business value and sensitive enough to detect the change. Avoid vanity metrics (page views) and composite metrics.

Guardrail Metrics

Metrics that must not regress. E.g., latency p99, crash rate, subscription cancellation. A positive primary metric doesn't override a guardrail regression.

Secondary Metrics

Supporting evidence. Track to understand mechanism, not for shipping decision. Reduces multiple comparisons risk.

Hypothesis Statement

State it as a sentence: "Treatment X will increase primary metric Y by at least Z% among population P." The Z% is your Minimum Detectable Effect (MDE) — the smallest effect that matters to the business.

Step 2 — Sample Size Calculation

Calculate required sample size BEFORE running. Running until p < 0.05 is p-hacking. Formula: n = 2 × (z_α/2 + z_β)² × σ² / δ² Parameters: α = 0.05 (significance), β = 0.20 (20% false negative → 80% power), σ = metric std dev, δ = MDE (minimum effect you care about detecting). Example: Conversion rate baseline 10%, MDE = 1% absolute, σ = √(0.1×0.9) = 0.3 → n ≈ 14,100 per variant → 28,200 total users. CUPED variance reduction: if you have pre-experiment metric data, you can reduce effective variance by 20-50% using CUPED, proportionally shortening experiment duration.

Unit of Randomization Trade-offs

Unit	Best For	Risk
User ID	Feature changes and UI tests	Long-term carry-over between experiments
Session ID	Session-level features (search / ads per session)	Same user can see both variants across sessions
Request ID	Infrastructure A/B tests	Very noisy — variance is high
Device ID	Mobile experiments	Cross-device users appear in both variants
Cluster (city/region)	Features with network effects	Much larger sample sizes needed

⚠ WARNING

Sample Ratio Mismatch (SRM)

Before analyzing results, verify the actual control/treatment split matches the intended split using a chi-square test. If you assigned 50/50 but got 48/52, something is wrong with the randomization or logging — results are invalid. Common causes: client-side filtering (mobile users exit before seeing treatment), bot traffic handled differently, logging bugs that drop some impressions.

Step 5 — Statistical Analysis

Check SRM first

chi-square test on user counts. If p < 0.05, stop — investigate the imbalance before interpreting results.

Compute test statistic

For proportions (CTR): two-proportion z-test. For means (revenue): Welch's t-test. For counts: Poisson rate test.

Compute p-value and confidence interval

Report CI, not just p-value. CI gives the range of plausible effect sizes, essential for practical significance judgment.

Apply multiple comparison correction

If testing k metrics, use Bonferroni (α/k) or Benjamini-Hochberg FDR correction. Guardrail metrics also count.

Check practical significance

Is the effect size large enough to justify the engineering and product cost of shipping? A 0.001% lift is rarely worth shipping.

Common A/B Testing Pitfalls

Pitfall	Symptom	Fix
Peeking / optional stopping	Appears significant but later regresses	Pre-commit to sample size; don't check daily
Novelty effect	Metrics spike and decay over 1-2 weeks	Run minimum 2 weeks; analyze week 2 separately
Network effects	Control contaminated by treatment	Cluster randomization by social graph or geography
Survivorship bias	Only analyzing engaged users	Include all users assigned to experiment
Simpson's Paradox	Aggregate result opposite of subgroup results	Stratify analysis by major segments
Multiple metrics	Some metrics significant by chance	Bonferroni correction; pre-specify primary metric

Common A/B Testing Pitfalls — Visual Guide

Rendering diagram...

Advanced: CUPED (Variance Reduction)

CUPED uses pre-experiment data to reduce variance of the metric estimate, enabling shorter experiments. Y_cuped = Y_post - θ × (X_pre - E[X_pre]) where θ = Cov(Y_post, X_pre) / Var(X_pre) and X_pre is the pre-experiment value of the same metric. Typical variance reduction: 20-50%. If original experiment needs 14,000 users, CUPED might reduce this to 7,000-11,000 — cutting experiment runtime in half. Requirement: you need pre-experiment measurements for the same metric. Works best for engagement metrics (session time, revenue) where historical user behavior is predictive.

When You Can't Randomize — Quasi-Experiments

Sometimes A/B tests are impossible: feature launched to all users, policy change affecting everyone, retroactive analysis. Quasi-experimental methods: Difference-in-Differences (DiD): Compare treated vs control across pre/post periods. Requires "parallel trends" assumption — pre-period trends were similar. Regression Discontinuity (RDD): Treatment assigned by threshold (e.g., users with score ≥ 700 get feature). Compare users just above/below threshold — they're virtually identical. Synthetic Control: Construct a weighted combination of control units that matches the treated unit's pre-period trajectory. Then measure divergence after treatment.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →