Sections
Related Guides
Statistics & Probability Foundations
Machine Learning
A/B Testing & Experimentation at Scale
Machine Learning
Cross-Validation Strategies: K-Fold, Time Series, Nested CV, and Leakage-Proof Pipelines
Machine Learning
Bias-Variance Tradeoff & ML Debugging
Machine Learning
Statistical Power, Sample Size & Experiment Design: The Complete Guide
Machine Learning
Bootstrap & Resampling — Uncertainty for Arbitrary Statistics
Master Efron's bootstrap, BCa confidence intervals, permutation tests, block bootstrap for time series, and jackknife. Covers when bootstrap fails (extremes, dependent data), production use at Netflix/Stripe, and the bagging connection to ML.
Why Bootstrap Exists — The Statistic Without a Formula
Classical statistics hands you a formula for the standard error of the sample mean (SE = σ/√n) and the sample proportion, but the moment you ask about the median, the 90th percentile, an NDCG score, a Gini coefficient, or the ratio of two means, the closed-form derivations collapse. Before 1979, applied statisticians would either invoke asymptotic approximations that held only for huge n, or they would punt.
Efron's bootstrap (1979) fixed this in one move: if you cannot derive the sampling distribution analytically, simulate it by resampling the data you already have. The key conceptual leap is the plug-in principle — treat the empirical distribution F̂ (the histogram of your sample) as a stand-in for the true distribution F. Draw B new samples of size n from F̂ with replacement, compute your statistic θ̂* on each, and the resulting B values approximate the true sampling distribution of θ̂.
The misconception that sinks candidates: "Bootstrap gives you more data." It does not. You have the same n observations — the bootstrap just extracts every drop of information about variability that those n points can give you. The uncertainty you estimate is a lower bound on your actual uncertainty, not a substitute for collecting more data.
What Interviewers Actually Test
A 6/10 answer says "bootstrap resamples with replacement to get confidence intervals." A 9/10 answer names the four common CI variants (percentile, basic, studentized, BCa), picks BCa as the default citing Efron 1987, and knows at least two cases where bootstrap fails (the maximum of a sample; dependent time-series data without block structure).
Staff-level signals: mentioning that bootstrap is a consistency result requiring regularity conditions (Bickel-Freedman 1981), distinguishing bootstrap from permutation tests for hypothesis testing, and citing block-length selection procedures (Politis-Romano). If asked about ML, connect bootstrap to bagging (Breiman 1996), out-of-bag error, and the .632+ estimator (Efron 1997).
Clarifying Questions Before Reaching for the Bootstrap
What statistic do you need uncertainty for?
Mean, median, quantile, ratio, AUC, NDCG? If it is the mean with large n, use the CLT and avoid bootstrap. Bootstrap pays for itself when the statistic is non-smooth, a ratio, a quantile, or a metric with no closed-form SE.
Is the data i.i.d.?
Time series? Clustered (users with multiple sessions)? Hierarchical? Plain i.i.d. bootstrap breaks for dependent data. Switch to block bootstrap (Kunsch 1989) for time series, cluster bootstrap for grouped data, or stationary bootstrap (Politis-Romano 1994).
What sample size and tail behavior?
n < 10 is too small — bootstrap variance estimates are unreliable. Heavy-tailed distributions (revenue, LTV) with possibly infinite variance violate bootstrap consistency. Consider subsampling (Politis-Romano-Wolf 1999) or a log-transform.
Do you want a CI or a hypothesis test?
CI → bootstrap (prefer BCa). Hypothesis test of 'are these two groups the same?' → permutation test. Bootstrap CIs inverted into tests have wrong Type I error under H0.
Is the statistic a function of extremes?
Max, min, top-k, range? Bootstrap fails — the bootstrap mass never exceeds the sample max, underestimating variance. Use extreme value theory or parametric methods.
What is your compute budget?
B = 200 for SEs, B = 1000–10000 for CIs. BCa and percentile are O(B·cost(θ̂)). Studentized bootstrap is O(B²) because it requires a nested bootstrap to estimate SE at each replicate. For expensive statistics (training a model), consider subsampling with smaller n*.
The Bootstrap Principle — Plug-in Estimation
The mathematical core: we want the sampling distribution of θ̂ = T(X₁,...,Xₙ) where Xᵢ ~ F. We do not know F, but we have X₁,...,Xₙ. The empirical distribution F̂ₙ places mass 1/n on each observed point. By the Glivenko-Cantelli theorem, F̂ₙ → F uniformly as n → ∞, so for smooth functionals T, T(F̂ₙ) → T(F).
The bootstrap step: draw X₁*,...,Xₙ* i.i.d. from F̂ₙ (equivalent to sampling with replacement from the original data). Compute θ̂* = T(X₁*,...,Xₙ*). Repeat B times. The distribution of {θ̂₁,...,θ̂_B} approximates the distribution of θ̂ centered around θ.
The bootstrap principle: the distribution of θ̂* − θ̂ (the bootstrapped statistic minus the plug-in estimator) approximates the distribution of θ̂ − θ (the estimator minus the true parameter). This is what makes bootstrap CIs valid — we use the bootstrap distribution of deviations, not the raw bootstrap distribution, to reason about where the true θ lies.
Bickel-Freedman (1981) proved consistency for the mean and many smooth functionals. The conditions that matter: finite variance, smoothness of T as a functional of F, and i.i.d. sampling. Violate any of these and the bootstrap can silently produce wrong CIs.
Bootstrap Estimator and Standard Error
Non-Parametric Bootstrap — The End-to-End Flow
Non-Parametric vs Parametric Bootstrap
Non-parametric bootstrap (the default) resamples directly from F̂ₙ. It makes no assumptions about the functional form of F. This is what 95% of production use cases need — ratio metrics, quantile metrics, ranking metrics.
Parametric bootstrap fits a parametric model (say, Normal(μ̂, σ̂²)) to the data, then draws B samples from the fitted model rather than resampling the raw observations. Use it when: (1) sample size is small (n < 30) and you have strong prior belief in the distributional family; (2) you need to bootstrap a statistic that depends on tail behavior the raw sample cannot express (e.g., the 99.9th percentile from n=500); (3) you want smoother bootstrap distributions for pivotal statistics.
Trade-off: parametric bootstrap is more efficient when the model is correct but badly misleading when the model is wrong — model misspecification propagates directly into the CI. Non-parametric is safer by default. A common hybrid at Stripe and Netflix: non-parametric bootstrap for business metrics, parametric for small-cell analyses where non-parametric has too few unique resamples.
Bootstrap CI Methods — Which One to Use
| Method | Formula (for 95% CI) | Assumes | Accuracy | When to Use |
|---|---|---|---|---|
| Percentile | [θ̂*_(0.025), θ̂*_(0.975)] | Bootstrap distribution is centered and symmetric around θ | First-order | Symmetric, unbiased statistic; quick-and-dirty |
| Basic (Reverse Percentile) | [2θ̂ − θ̂*_(0.975), 2θ̂ − θ̂*_(0.025)] | Bootstrap distribution of θ̂*−θ̂ approximates θ̂−θ | First-order | Biased statistics where percentile misbehaves |
| Studentized (boot-t) | [θ̂ − t*_(0.975)·ŜE, θ̂ − t*_(0.025)·ŜE] | Can estimate SE of θ̂ for each replicate | Second-order (most accurate) | High-accuracy work; cost is O(B²) due to nested bootstrap |
| BCa (Bias-Corrected + Accelerated) | Adjusted percentiles using bias z₀ and acceleration a | Smooth, monotone transformation makes distribution normal | Second-order, transformation-invariant | Production default — what R's `boot` package uses |
BCa Confidence Interval Formulas
Non-Parametric Bootstrap with BCa Confidence Interval
import numpy as np
from scipy import stats
def bootstrap_bca(data: np.ndarray, statistic, B: int = 10000,
alpha: float = 0.05, rng_seed: int = 42):
"""
BCa (Bias-Corrected and Accelerated) bootstrap CI.
Efron (1987). Production default — transformation invariant and
second-order accurate. Used in R's `boot::boot.ci(type='bca')`.
"""
rng = np.random.default_rng(rng_seed)
n = len(data)
theta_hat = statistic(data)
# 1. Bootstrap replicates
indices = rng.integers(0, n, size=(B, n))
theta_star = np.array([statistic(data[idx]) for idx in indices])
# 2. Bias correction z0: share of bootstrap replicates below theta_hat
prop_below = np.mean(theta_star < theta_hat)
# Clip to avoid infinite z0 when prop_below is 0 or 1
prop_below = np.clip(prop_below, 1 / B, 1 - 1 / B)
z0 = stats.norm.ppf(prop_below)
# 3. Acceleration a: via jackknife (leave-one-out)
jack = np.array([statistic(np.delete(data, i)) for i in range(n)])
jack_mean = jack.mean()
num = np.sum((jack_mean - jack) ** 3)
den = 6.0 * (np.sum((jack_mean - jack) ** 2) ** 1.5)
a = num / den if den > 0 else 0.0
# 4. Adjusted percentiles
z_lo, z_hi = stats.norm.ppf(alpha / 2), stats.norm.ppf(1 - alpha / 2)
alpha1 = stats.norm.cdf(z0 + (z0 + z_lo) / (1 - a * (z0 + z_lo)))
alpha2 = stats.norm.cdf(z0 + (z0 + z_hi) / (1 - a * (z0 + z_hi)))
ci_lo = np.quantile(theta_star, alpha1)
ci_hi = np.quantile(theta_star, alpha2)
return {
"estimate": theta_hat,
"se": theta_star.std(ddof=1),
"ci_bca": (ci_lo, ci_hi),
"bias_correction_z0": z0,
"acceleration_a": a,
}
# Example: 95% CI for the median of skewed revenue data
np.random.seed(0)
revenue = np.random.lognormal(mean=3.0, sigma=1.2, size=500)
result = bootstrap_bca(revenue, statistic=np.median, B=10000)
# -> estimate ~20.1, BCa CI ~(18.0, 22.3)
When the Bootstrap Fails — Four Traps
Bootstrap is not universal. The four failure modes below are common interview traps.
1. Extremes and order statistics. If θ = max(X), the bootstrap estimate max(X*) can never exceed max(X) — every resample is drawn from the original sample. The bootstrap distribution has a point mass at the true max with high probability, grossly underestimating variance. Frangos-Schucany (1990) formalized this. Fix: extreme value theory (GEV distributions) or subsampling.
2. Dependent data. I.i.d. bootstrap destroys autocorrelation. Apply it to time-series and your SEs will be dramatically too small. Fix: moving block bootstrap (Kunsch 1989) resamples contiguous blocks of length ℓ, preserving local dependence. Stationary bootstrap (Politis-Romano 1994) uses random block lengths from a geometric distribution. Block length ℓ ≈ n^(1/3) is a starting heuristic; Politis-Romano provides a data-driven procedure.
3. Very small samples (n < 10). With n = 5, there are only C(9,5) = 126 distinct bootstrap samples. The bootstrap distribution is discrete and coarse. Use exact methods (permutation, Fisher's exact) or parametric bootstrap.
4. Heavy tails with infinite variance. For Pareto-like distributions with shape α < 2, the variance is infinite and the CLT does not apply. Bootstrap CIs for the mean have incorrect coverage. Fix: log-transform, trimmed means, or m-out-of-n subsampling.
Block Bootstrap for Time Series — Preserving Dependence
Permutation Tests — The Right Tool for Hypothesis Testing
Bootstrap CIs and permutation tests are often confused. They answer different questions.
Bootstrap estimates the sampling distribution of a statistic given the data. It is about uncertainty of the estimate and natively produces CIs.
Permutation test directly tests a null hypothesis of exchangeability — under H₀, the group labels are irrelevant, so any permutation of labels is equally likely. Shuffle the labels, recompute the statistic, and the fraction of permutations with statistic ≥ observed gives the p-value. This is exact: no asymptotics, no distributional assumptions, valid at any sample size.
When to use permutation over bootstrap for testing: always, for two-sample mean differences, correlations, regression coefficients, and any two-group comparison. Bootstrap "tests" (e.g., check if 0 is in the bootstrap CI of θ̂₁ − θ̂₂) have subtly wrong Type I error under H₀ because the bootstrap distribution is centered around the observed difference, not around 0.
Limitation of permutation tests: they require exchangeability. If treatment has heterogeneous variance (e.g., unequal-variance Welch case), the vanilla permutation test loses validity — use studentized statistics or the Janssen-Pauls modification.
Two-Sample Permutation Test for Mean Difference
import numpy as np
def permutation_test_two_sample(x: np.ndarray, y: np.ndarray,
n_perm: int = 10000,
rng_seed: int = 42) -> dict:
"""
Exact permutation test for H0: x and y from the same distribution.
Tests difference in means; change the `statistic` call for other functionals.
Used at Netflix/Stripe when A/B test metrics are ratios or quantiles
where closed-form tests are not available.
"""
rng = np.random.default_rng(rng_seed)
observed = x.mean() - y.mean()
combined = np.concatenate([x, y])
n_x = len(x)
n_total = len(combined)
perm_stats = np.empty(n_perm)
for i in range(n_perm):
# Shuffle labels by reindexing
idx = rng.permutation(n_total)
perm_x = combined[idx[:n_x]]
perm_y = combined[idx[n_x:]]
perm_stats[i] = perm_x.mean() - perm_y.mean()
# Two-sided p-value
p_value = np.mean(np.abs(perm_stats) >= np.abs(observed))
# Add 1/(n_perm+1) correction for Monte-Carlo null (Phipson-Smyth 2010)
p_value_corrected = (np.sum(np.abs(perm_stats) >= np.abs(observed)) + 1) / (n_perm + 1)
return {
"observed_diff": observed,
"p_value": p_value,
"p_value_corrected": p_value_corrected,
"null_distribution_mean": perm_stats.mean(), # ~0 under H0
"null_distribution_sd": perm_stats.std(ddof=1),
}
# Example: A/B test on skewed revenue data — permutation vs t-test
rng = np.random.default_rng(0)
control = rng.lognormal(3.0, 1.2, 500)
treatment = rng.lognormal(3.05, 1.2, 500) # 5% lift in log scale
out = permutation_test_two_sample(control, treatment, n_perm=10000)
# p_value ~ 0.04 — permutation captures heavy-tail behavior t-test misses
Bootstrap in Machine Learning — Bagging, OOB, Deep Ensembles
The bootstrap is not just a statistical tool — it is the foundation of major ML methods.
Bagging (Breiman 1996) — "bootstrap aggregating" — trains B models on B bootstrap samples of the training data and averages predictions. Each bootstrap sample contains ~63.2% of the training data, leaving ~36.8% as out-of-bag (OOB) samples for free, unbiased error estimation without a separate validation set. Random forests are bagging + random feature subsets at each split. For B = 500 trees, OOB error closely approximates 10-fold CV error at a fraction of the cost.
.632+ estimator (Efron 1997) generalizes OOB. Naive OOB is slightly pessimistic (trains on 63.2% of data); .632+ combines training error and OOB error with data-dependent weights to de-bias, and is the theoretically preferred small-n alternative to CV.
Uncertainty quantification for ML. Bootstrap the training set → train B models → get a distribution over predictions at each test point. Practical for linear models, trees, and small neural nets. For deep learning, the bootstrap is prohibitively expensive, but deep ensembles (Lakshminarayanan et al., 2017) — training B networks from different random initializations on the same data — emerged as the DL analog and remain the strongest baseline for uncertainty estimation, outperforming MC-Dropout and variational methods in most benchmarks.
Bootstrap CIs for ML metrics (AUC, precision@k, NDCG@10) are the standard approach — no closed-form SE exists for most ranking metrics, so bootstrap the test set and recompute.
Bootstrap vs Permutation vs Parametric CI — Decision Matrix
| Method | Goal | Key Assumption | Accuracy | Cost | When It Fails |
|---|---|---|---|---|---|
| Parametric CI (t, z) | CI for mean/proportion | Normality or CLT applicable; known SE formula | Exact if assumption holds | O(1) | Non-normal, heavy-tailed, quantiles, ratios |
| Non-parametric Bootstrap | CI for any smooth statistic | i.i.d. data, finite variance, smooth functional | First-to-second order | O(B·cost(θ̂)) | Extremes, dependence, n<10, infinite variance |
| Block Bootstrap | CI for time-series statistic | Stationarity, mixing | First-order | O(B·cost(θ̂)) | Non-stationary, structural breaks |
| Permutation Test | P-value for H0: exchangeability | Exchangeability under H0 | Exact | O(P·cost(θ̂)) | Unequal variances (use studentized), paired data |
| Jackknife | Bias + SE for smooth statistic | Smooth functional, i.i.d. | First-order, less efficient than bootstrap | O(n·cost(θ̂)) | Non-smooth statistics (median with even n) |
Jackknife — The Cheaper Cousin
Jackknife (Quenouille 1949, Tukey) predates the bootstrap by 30 years. For each i ∈ {1,...,n}, compute θ̂₍ᵢ₎ = T(X₋ᵢ) — the statistic with the i-th observation removed. The jackknife SE and bias estimators use these n leave-one-out replicates.
Advantages: only n computations (cheaper than B=10,000 bootstrap). Produces the acceleration parameter a in BCa via the third-moment formula. Historically the go-to bias-correction method.
Disadvantages: less efficient than bootstrap; fails catastrophically for non-smooth statistics like the median when n is even (the median jumps discontinuously as points are removed, breaking the smoothness required for consistency). For any modern application, prefer the bootstrap — jackknife's main role today is computing BCa's acceleration and out-of-bag error in random forests.
Common Bootstrap Pitfalls in Production
1. Ignoring pairing. For paired data (before/after per user), you must resample pairs, not individual observations — otherwise you break the dependence that the paired design is exploiting, inflating variance.
2. Ignoring clustering. If users appear multiple times (multiple sessions, multiple purchases), resample at the user level (cluster bootstrap), not the row level. Row-level bootstrap undercounts within-user correlation and produces CIs that are too narrow — the #1 source of "significant" results that do not replicate.
3. Bootstrap for testing instead of permutation. If the question is "is A different from B under H₀," use a permutation test. Bootstrap CIs inverted into tests have the wrong reference distribution.
4. Too few bootstrap samples. B = 100 is insufficient for any CI work — tail quantiles are noisy. Use B ≥ 1,000 for 95% CIs and B ≥ 10,000 if you need 99% CIs or BCa acceleration estimates to be stable.
5. Using percentile CI when statistic is biased. If θ̂ is biased (e.g., the sample variance with the 1/n denominator), percentile CIs are wrong. BCa corrects for bias; basic CI does as well but loses the transformation-invariance of BCa.
Production Use at Netflix, Stripe, and Beyond
Netflix uses bootstrap for CIs on ranking metrics (NDCG, MRR) in offline experimentation, and for revenue-per-user CIs where the long-tail of subscription revenue violates t-test assumptions. Block bootstrap handles autocorrelated engagement metrics.
Stripe uses bootstrap for A/B test CIs on payment success rates by merchant segment (heavy-tailed transaction volume) and for ratio metrics like authorization rate per attempt, where the numerator and denominator are correlated and no closed-form SE exists.
Uber, DoorDash, LinkedIn apply cluster bootstrap at the user or city level to avoid the within-user correlation trap for metrics like bookings-per-session or jobs-viewed-per-visit.
General patterns: (1) bootstrap for all ratio metrics; (2) cluster bootstrap when the randomization unit is not the measurement unit; (3) block bootstrap for intraday seasonality or autocorrelated metrics; (4) permutation tests for hypothesis testing on any custom metric without a standard test; (5) B = 10,000 as the default for production CIs, computed asynchronously via Spark or Ray if the statistic is expensive.
What to Say in the Interview Tomorrow
"Bootstrap is Efron's 1979 plug-in method for simulating the sampling distribution of any statistic by resampling with replacement from the empirical distribution. It does not create new data — it quantifies what variability the existing n points imply about the estimand.
For production, I default to BCa CIs (Efron 1987) with B=10,000 because BCa corrects for bias and skewness and is transformation-invariant. I use percentile CIs only as a quick sanity check.
I know bootstrap fails for extremes (max, min), dependent data (use block bootstrap — Kunsch 1989), n<10, and infinite variance distributions. For hypothesis testing I reach for permutation tests, not bootstrap CIs — the reference distribution is correct under H₀.
In ML, bootstrap is the foundation of bagging (Breiman 1996) — each tree in a random forest is trained on a bootstrap sample, and the ~36.8% out-of-bag samples give free unbiased error estimates. For deep learning uncertainty, deep ensembles (Lakshminarayanan 2017) are the practical analog since bootstrapping full training runs is too expensive.
The trap I avoid: bootstrapping rows when data is clustered by user. That undercounts within-user correlation and produces CIs that are systematically too narrow — use cluster bootstrap at the randomization unit."
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →Start Solving
You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.
Open Coding Problem →