Sections

0/10

Related Guides

A/B Testing & Experimentation at Scale

Machine Learning

70m

Statistical Power, Sample Size & Experiment Design: The Complete Guide

Machine Learning

40m

Causal Inference: DiD, Instrumental Variables, RDD, and When A/B Tests Fail

Machine Learning

50m

Probability Distributions: The Production ML Engineer's Reference

Machine Learning

35m

Quiz

← Back to Library

Machine Learning·Intermediate

Hypothesis Testing for Data Scientists: p-values, Type I/II, Multiple Testing

The complete hypothesis testing framework for machine learning interviews — p-value interpretation, Type I/II error trade-offs, when to use z-tests vs t-tests vs chi-square, Bonferroni and BH-FDR corrections, and effect size. Covers the most common interview traps candidates fail silently.

40 min read 10 sections 7 interview questions

Hypothesis Testingp-valueType I ErrorType II ErrorNull HypothesisMultiple TestingBonferroni CorrectionFalse Discovery RateCohen's dEffect Sizez-testt-testChi-SquareStatistical SignificanceA/B Test

The Null Hypothesis Framework: What You're Actually Testing

Every hypothesis test answers one question: is the data surprising enough, assuming nothing interesting is happening?

The framework has four components:

H0 (Null hypothesis): The default assumption — no effect, no difference, no relationship. You never "prove" H0; you either reject it or fail to reject it. "The new recommendation algorithm has no effect on CTR" is H0.

H1 (Alternative hypothesis): What you would conclude if H0 is rejected. Can be directional (one-tailed: "CTR increases") or non-directional (two-tailed: "CTR changes"). Always specify H1 BEFORE seeing data.

Test statistic: A number computed from your sample that summarizes how far the data falls from what H0 predicts. For a mean comparison: t = (x1_bar - x2_bar) / SE. The further from zero, the more inconsistent with H0.

p-value: The probability of observing a test statistic as extreme as (or more extreme than) the one computed, ASSUMING H0 is true. If p = 0.03, it means: if H0 were true, you would see data this extreme only 3% of the time by chance.

The decision rule: Reject H0 if p < alpha (your pre-specified significance threshold, typically 0.05). This means you accept a 5% chance of wrongly rejecting a true null hypothesis.

"Fail to reject H0" means: not that H0 is true — absence of evidence is not evidence of absence. It means the data is consistent with H0 given your sample size and effect size. An underpowered study will always fail to reject H0 regardless of whether an effect exists.

IMPORTANT

The #1 Interview Trap: What a p-value Is NOT

Candidates fail this question constantly. Interviewers ask it specifically because 80%+ of data scientists get it wrong.

p-value IS: P(data this extreme | H0 is true) — the probability of the observed data (or more extreme) under the null hypothesis.

p-value IS NOT:

The probability that H0 is true given your data. That is the Bayesian posterior P(H0|data) — requires a prior and is a completely different quantity.
The probability that your result is due to chance. This confuses the direction of conditioning.
The probability that the effect is real or meaningful. p < 0.05 with n=10,000,000 can detect a 0.001% difference that nobody cares about.
1 minus the probability that H1 is true.

The correct statement: "If the null hypothesis were true, we would see a result this extreme about p*100% of the time by chance. Since p < 0.05, this is surprising enough that we reject H0 at the 5% significance level."

The effect size trap: p < 0.05 says nothing about whether the effect is large or practically important. With around 10 million users in an A/B test, you will detect statistically significant differences that are too small to matter to the business. Always report effect size alongside p-value.

Hypothesis Testing Decision Tree

Rendering diagram...

Choosing the Right Test — Quick Reference

Scenario	Test	Null Hypothesis	When to Use
Sample mean vs known mu	One-sample t-test	H0: mu = mu0	n < 30 and sigma unknown; always safe choice
Two independent group means	Welch's t-test	H0: mu1 = mu2	Default for A/B on continuous metrics; does not assume equal variance
Two proportions (CTR, conversion)	Two-proportion z-test	H0: p1 = p2	Large n (both n*p >= 5); standard for A/B on binary outcomes
Paired measurements (before/after same users)	Paired t-test	H0: mu_diff = 0	Same users measured twice; reduces variance from individual differences
Categorical distribution vs expected	Chi-square goodness of fit	H0: dist matches expected	Testing if counts follow a theoretical distribution; also used for SRM check
Association between two categorical variables	Chi-square independence	H0: no association	Contingency tables; A/B on segmented categorical outcomes
More than 2 group means (omnibus)	One-way ANOVA	H0: all means equal	Testing multiple variants; follow with Tukey HSD for pairwise comparisons

Type I and Type II Errors: The Trade-off You Control

Every hypothesis test makes one of two errors:

Type I Error (alpha, false positive): Rejecting H0 when it is actually true. "Declaring a winner when there is no real difference." Controlled directly by your choice of alpha. At alpha = 0.05, you accept a 5% Type I error rate.

Type II Error (beta, false negative): Failing to reject H0 when H1 is true. "Missing a real effect." Statistical power = 1 - beta = probability of correctly detecting a true effect when one exists. Standard target: power = 0.80.

The trade-off: For fixed sample size, decreasing alpha (stricter threshold, fewer false positives) directly increases beta (more missed real effects). You cannot simultaneously reduce both without increasing n.

What controls each:

alpha: You set this before the experiment. Determines how much false positive risk you accept.
beta: Determined by n, sigma, and the true effect delta. To reduce beta, you need larger n, smaller sigma (via CUPED or better targeting), or a larger true effect.

Practical guidance: Use alpha = 0.05 as the default. Use alpha = 0.01 when false positives are very costly (ranking changes, payments, medical contexts). Use alpha = 0.10 for early-stage exploration with low launch costs. Never change alpha after seeing data.

The confusion matrix framing: Reject H0 when H0 is true = Type I Error (alpha). Reject H0 when H1 is true = Power (1 - beta). Fail to reject when H0 is true = Correct. Fail to reject when H1 is true = Type II Error (beta).

Multiple Comparisons: Where Most Analyses Go Wrong

If you test k independent hypotheses, each at alpha = 0.05, the probability of at least one false positive is:

FWER = 1 - (1 - alpha)^k

For k = 10: FWER = 1 - 0.95^10 = 40%. You will almost certainly find a "significant" result even if all nulls are true.

Three scenarios where this bites you in practice:

Testing 15 metrics in an experiment at alpha = 0.05 each: 54% chance of at least one spurious significant metric.
Checking results at days 3, 5, 7, 10 of an experiment (4 peeks): effective alpha inflates from 0.05 to approximately 0.14.
Post-hoc subgroup analysis: running across 20 countries to find the "surprisingly significant" submarket.

Bonferroni Correction (family-wise error rate control): Adjust threshold: alpha_bonferroni = alpha / k. For 10 tests: 0.05/10 = 0.005. Conservative — assumes all tests are independent (rarely true). Use when any false positive is very costly.

Benjamini-Hochberg FDR (1995): Controls the False Discovery Rate — expected fraction of rejected nulls that are false positives. Less conservative than Bonferroni. Procedure: Sort p-values p1 <= p2 <= ... <= pm. Find largest k such that pk <= (k/m) * q, where q = desired FDR (typically 0.10). Reject all hypotheses 1 through k. BH-FDR is appropriate for exploratory multi-metric analysis. Bonferroni is appropriate when a single specific metric must be significant for a ship decision.

Primary metric doctrine: Pre-specify ONE primary metric before seeing data. Multiple metrics are exploratory secondary analysis. The primary metric drives ship/no-ship. This eliminates the multiple comparisons problem for the decision.

Hypothesis Tests in Python — scipy Reference

pythonhypothesis_tests.py

import numpy as np
from scipy import stats

# Two-sample Welch's t-test (default for A/B on continuous metrics)
control   = np.array([4.2, 3.8, 5.1, 4.7, 3.9, 4.5, 5.0, 4.3])
treatment = np.array([5.1, 4.9, 5.8, 5.3, 5.0, 5.5, 4.8, 5.2])

t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")

# Two-proportion z-test (CTR, conversion rates)
def two_proportion_ztest(n1, x1, n2, x2):
    """Returns (z_stat, p_value, 95% CI) for two-tailed test."""
    p1, p2 = x1 / n1, x2 / n2
    p_pool = (x1 + x2) / (n1 + n2)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z = (p1 - p2) / se
    p_val = 2 * (1 - stats.norm.cdf(abs(z)))
    ci_95 = (p1 - p2) + np.array([-1, 1]) * 1.96 * np.sqrt(
        p1*(1-p1)/n1 + p2*(1-p2)/n2
    )
    return z, p_val, ci_95

z, p, ci = two_proportion_ztest(10000, 310, 10000, 350)
print(f"z={z:.3f}, p={p:.4f}, 95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]")
# Control ~3.1%, Treatment ~3.5%: is the typical 0.4pp lift significant?

# Chi-square test (SRM detection, categorical)
observed = np.array([4823, 5177])  # actual 50/50 assignment logged
expected = np.array([5000, 5000])  # intended 50/50 split
chi2, p_srm = stats.chisquare(observed, f_exp=expected)
print(f"SRM check: chi2={chi2:.3f}, p={p_srm:.4f}")
if p_srm < 0.05:
    print("WARNING: Sample Ratio Mismatch — do not interpret results!")

# Paired t-test (same users measured before and after)
before = np.array([3.2, 2.8, 4.1, 3.7, 2.9])
after  = np.array([3.8, 3.5, 4.6, 4.2, 3.4])
t_paired, p_paired = stats.ttest_rel(before, after)
print(f"Paired: t={t_paired:.3f}, p={p_paired:.4f}")

Multiple Testing Correction and Effect Size — Python Reference

pythonmultiple_testing.py

import numpy as np
from scipy import stats

def bonferroni_correction(pvalues, alpha=0.05):
    """FWER control. Use when any false positive is costly."""
    k = len(pvalues)
    adjusted_alpha = alpha / k
    rejected = [p < adjusted_alpha for p in pvalues]
    return {
        "adjusted_alpha": adjusted_alpha,
        "rejected": rejected,
        "significant_count": sum(rejected),
    }

def benjamini_hochberg(pvalues, fdr_level=0.10):
    """BH FDR control. Less conservative; use for exploratory analysis."""
    m = len(pvalues)
    sorted_idx = np.argsort(pvalues)
    sorted_pvals = np.array(pvalues)[sorted_idx]
    thresholds = [(i + 1) / m * fdr_level for i in range(m)]
    rejected_mask = sorted_pvals <= thresholds
    if rejected_mask.any():
        cutoff = np.where(rejected_mask)[0].max()
        final_rejected = np.zeros(m, dtype=bool)
        final_rejected[:cutoff + 1] = True
    else:
        final_rejected = np.zeros(m, dtype=bool)
    result = np.zeros(m, dtype=bool)
    result[sorted_idx] = final_rejected
    return {"rejected": result.tolist(), "significant_count": int(result.sum())}

pvalues = [0.001, 0.04, 0.08, 0.02, 0.15, 0.003, 0.09, 0.12, 0.05, 0.21]

bonf = bonferroni_correction(pvalues, alpha=0.05)
bh   = benjamini_hochberg(pvalues, fdr_level=0.10)

print(f"Naive (alpha=0.05): {sum(p < 0.05 for p in pvalues)} significant")
print(f"Bonferroni (alpha/10=0.005): {bonf['significant_count']} significant")
print(f"BH-FDR (q=0.10): {bh['significant_count']} significant")

def cohen_d(group1, group2):
    """Pooled SD Cohen's d. |d| < 0.2 small, ~0.5 medium, > 0.8 large."""
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (np.mean(group2) - np.mean(group1)) / pooled_std

# With large n, even tiny Cohen's d becomes "significant"
rng  = np.random.default_rng(42)
ctrl = rng.normal(0, 1, size=100_000)
trt  = rng.normal(0.02, 1, size=100_000)   # true d = 0.02 (trivially small)
t, p = stats.ttest_ind(ctrl, trt)
d    = cohen_d(ctrl, trt)
print(f"n=100K: p={p:.4f} (significant!), Cohen's d={d:.3f} (negligible)")
# p < 0.001 but d ~ 0.02: statistically significant, practically worthless

⚠ WARNING

Practical vs Statistical Significance: The Trap That Kills Credibility

With large enough n, every test will reject H0. At n = 1 million per group, you can detect a 0.001% conversion rate difference at p < 0.001. That difference is real. It is also irrelevant to any business decision.

Cohen's d benchmarks (for standardized effect size):

Small: d = 0.2 (often too small to justify engineering cost)
Medium: d = 0.5 (usually meaningful for product features)
Large: d = 0.8 (very meaningful; rarely seen in mature products)

For proportions, always report absolute percentage point lift AND relative lift:

"CTR increased from a typical 3.00% to 3.04% (p < 0.001)" — statistically significant, practically irrelevant
"CTR increased from a typical 3.00% to 3.30% (p < 0.001, +10% relative)" — worth shipping

Always report both: p-value (statistical significance) AND confidence interval or effect size (practical significance). A CI of approximately [0.002%, 0.006%] tells you the effect is real but tiny. A CI of approximately [0.4%, 0.6%] tells you the effect is real and meaningful.

Interview trap: The interviewer will show results with p=0.001 and ask "should we ship?" The right answer always checks the confidence interval for practical significance — not just the p-value.

One-Tailed vs Two-Tailed Tests: When Each Applies

Two-tailed test (default): H1: mu1 != mu2. Splits alpha across both tails (alpha/2 each). Critical value for alpha=0.05: z = plus or minus 1.96.

Use when: the effect could go in either direction. This is almost always the right choice. Even if you expect a positive effect, a negative effect (regression) is critical to detect.

One-tailed test: H1: mu1 > mu2 or mu1 < mu2. Concentrates all alpha in one tail. Critical value for alpha=0.05: z = 1.645. Slightly more power to detect effects in the specified direction.

Use ONLY when: (1) you pre-specify direction before seeing ANY data, AND (2) an effect in the opposite direction would have identical business implications as no effect — meaning you would take the same action regardless of which of those two outcomes occurs.

Why product teams almost always should use two-tailed: If your new checkout flow somehow increases cart abandonment, that is NOT equivalent to no effect — it is a regression that needs detection. Using one-tailed to boost power at the cost of missing regressions causes production incidents.

The p-hacking trap: Switching from two-tailed to one-tailed after seeing the data trending positive halves the p-value for the same data. This is p-hacking regardless of what story you tell yourself. Always decide the tail before running the test.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.