The math every ML engineer and data scientist must know to design experiments that actually detect real effects. Covers Type I/II errors, statistical power, the sample size formula and why MDE matters quadratically, CUPED variance reduction (used at Netflix, Booking.com, Airbnb), multiple comparisons corrections, sequential testing for early stopping, and the most common experiment design mistakes.

40 min read 2 sections 1 interview questions

Statistical PowerSample SizeMDEMinimum Detectable EffectType I ErrorType II ErrorCUPEDMultiple ComparisonsBonferroniBenjamini-HochbergFDRSequential TestingSPRTA/B TestingHypothesis Testing

The Four Quantities That Govern Every Experiment

Every hypothesis test involves four quantities in a fixed mathematical relationship. Fixing any three determines the fourth.

1. Significance level α (Type I error rate): The probability of rejecting H₀ when it's true (false positive). Standard: α = 0.05. If you run 100 A/A tests (no real effect), 5 will appear 'significant'. This is the cost of exploration.

2. Statistical power (1-β): The probability of detecting a real effect when it exists. Standard: 0.80 (80% power). This means if the true effect is exactly at the MDE, you have an 80% chance of detecting it. 20% of the time, you'll miss it (Type II error, false negative). Higher power = more conservative = needs larger samples.

3. Effect size (Minimum Detectable Effect, MDE): The smallest effect you care about detecting reliably. This is a business decision, not statistical: what lift is meaningful enough to justify shipping the change? MDE is not the same as the expected effect — it's the threshold below which you're indifferent. A 0.1% conversion lift on $10B revenue is meaningful; on $100K revenue it's noise.

4. Sample size n: Derived from the other three: n = (Z_α/2 + Z_β)² × (σ₁² + σ₂²) / δ²

For proportions (conversion rates): n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²

With α=0.05 (two-sided), power=80%: Z_α/2 = 1.96, Z_β = 0.842. The sum squared ≈ 7.85.

The MDE quadrupling rule (most important practical fact): Because δ appears squared in the denominator, halving the MDE quadruples the required sample size. A test designed to detect a 1% lift needs ~4× as many users as a test designed to detect a 2% lift. This is why teams argue about MDE — it determines how long your test runs.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade