Sections

Master Efron's bootstrap, BCa confidence intervals, permutation tests, block bootstrap for time series, and jackknife. Covers when bootstrap fails (extremes, dependent data), production use at Netflix/Stripe, and the bagging connection to ML.

50 min read 20 sections 8 interview questions

BootstrapResamplingConfidence IntervalsBCaPermutation TestJackknifeBlock BootstrapBaggingEfronNon-Parametric StatisticsMonte CarloUncertainty Quantification

Why Bootstrap Exists — The Statistic Without a Formula

Classical statistics hands you a formula for the standard error of the sample mean (SE = σ/√n) and the sample proportion, but the moment you ask about the median, the 90th percentile, an NDCG score, a Gini coefficient, or the ratio of two means, the closed-form derivations collapse. Before 1979, applied statisticians would either invoke asymptotic approximations that held only for huge n, or they would punt.

Efron's bootstrap (1979) fixed this in one move: if you cannot derive the sampling distribution analytically, simulate it by resampling the data you already have. The key conceptual leap is the plug-in principle — treat the empirical distribution F̂ (the histogram of your sample) as a stand-in for the true distribution F. Draw B new samples of size n from F̂ with replacement, compute your statistic θ̂* on each, and the resulting B values approximate the true sampling distribution of θ̂.

The misconception that sinks candidates: "Bootstrap gives you more data." It does not. You have the same n observations — the bootstrap just extracts every drop of information about variability that those n points can give you. The uncertainty you estimate is a lower bound on your actual uncertainty, not a substitute for collecting more data.

IMPORTANT

What Interviewers Actually Test

A 6/10 answer says "bootstrap resamples with replacement to get confidence intervals." A 9/10 answer names the four common CI variants (percentile, basic, studentized, BCa), picks BCa as the default citing Efron 1987, and knows at least two cases where bootstrap fails (the maximum of a sample; dependent time-series data without block structure).

Staff-level signals: mentioning that bootstrap is a consistency result requiring regularity conditions (Bickel-Freedman 1981), distinguishing bootstrap from permutation tests for hypothesis testing, and citing block-length selection procedures (Politis-Romano). If asked about ML, connect bootstrap to bagging (Breiman 1996), out-of-bag error, and the .632+ estimator (Efron 1997).

Clarifying Questions Before Reaching for the Bootstrap

What statistic do you need uncertainty for?

Mean, median, quantile, ratio, AUC, NDCG? If it is the mean with large n, use the CLT and avoid bootstrap. Bootstrap pays for itself when the statistic is non-smooth, a ratio, a quantile, or a metric with no closed-form SE.

Is the data i.i.d.?

Time series? Clustered (users with multiple sessions)? Hierarchical? Plain i.i.d. bootstrap breaks for dependent data. Switch to block bootstrap (Kunsch 1989) for time series, cluster bootstrap for grouped data, or stationary bootstrap (Politis-Romano 1994).

What sample size and tail behavior?

n < 10 is too small — bootstrap variance estimates are unreliable. Heavy-tailed distributions (revenue, LTV) with possibly infinite variance violate bootstrap consistency. Consider subsampling (Politis-Romano-Wolf 1999) or a log-transform.

Do you want a CI or a hypothesis test?

CI → bootstrap (prefer BCa). Hypothesis test of 'are these two groups the same?' → permutation test. Bootstrap CIs inverted into tests have wrong Type I error under H0.

Is the statistic a function of extremes?

Max, min, top-k, range? Bootstrap fails — the bootstrap mass never exceeds the sample max, underestimating variance. Use extreme value theory or parametric methods.

What is your compute budget?

B = 200 for SEs, B = 1000–10000 for CIs. BCa and percentile are O(B·cost(θ̂)). Studentized bootstrap is O(B²) because it requires a nested bootstrap to estimate SE at each replicate. For expensive statistics (training a model), consider subsampling with smaller n*.

The Bootstrap Principle — Plug-in Estimation

The mathematical core: we want the sampling distribution of θ̂ = T(X₁,...,Xₙ) where Xᵢ ~ F. We do not know F, but we have X₁,...,Xₙ. The empirical distribution F̂ₙ places mass 1/n on each observed point. By the Glivenko-Cantelli theorem, F̂ₙ → F uniformly as n → ∞, so for smooth functionals T, T(F̂ₙ) → T(F).

The bootstrap step: draw X₁*,...,Xₙ* i.i.d. from F̂ₙ (equivalent to sampling with replacement from the original data). Compute θ̂* = T(X₁*,...,Xₙ*). Repeat B times. The distribution of {θ̂₁,...,θ̂_B} approximates the distribution of θ̂ centered around θ.

The bootstrap principle: the distribution of θ̂* − θ̂ (the bootstrapped statistic minus the plug-in estimator) approximates the distribution of θ̂ − θ (the estimator minus the true parameter). This is what makes bootstrap CIs valid — we use the bootstrap distribution of deviations, not the raw bootstrap distribution, to reason about where the true θ lies.

Bickel-Freedman (1981) proved consistency for the mean and many smooth functionals. The conditions that matter: finite variance, smoothness of T as a functional of F, and i.i.d. sampling. Violate any of these and the bootstrap can silently produce wrong CIs.

Bootstrap Estimator and Standard Error

Non-Parametric Bootstrap — The End-to-End Flow

Rendering diagram...

Non-Parametric vs Parametric Bootstrap

Non-parametric bootstrap (the default) resamples directly from F̂ₙ. It makes no assumptions about the functional form of F. This is what 95% of production use cases need — ratio metrics, quantile metrics, ranking metrics.

Parametric bootstrap fits a parametric model (say, Normal(μ̂, σ̂²)) to the data, then draws B samples from the fitted model rather than resampling the raw observations. Use it when: (1) sample size is small (n < 30) and you have strong prior belief in the distributional family; (2) you need to bootstrap a statistic that depends on tail behavior the raw sample cannot express (e.g., the 99.9th percentile from n=500); (3) you want smoother bootstrap distributions for pivotal statistics.

Trade-off: parametric bootstrap is more efficient when the model is correct but badly misleading when the model is wrong — model misspecification propagates directly into the CI. Non-parametric is safer by default. A common hybrid at Stripe and Netflix: non-parametric bootstrap for business metrics, parametric for small-cell analyses where non-parametric has too few unique resamples.

Bootstrap CI Methods — Which One to Use

Method	Formula (for 95% CI)	Assumes	Accuracy	When to Use
Percentile	[θ̂_(0.025), θ̂_(0.975)]	Bootstrap distribution is centered and symmetric around θ	First-order	Symmetric, unbiased statistic; quick-and-dirty
Basic (Reverse Percentile)	[2θ̂ − θ̂_(0.975), 2θ̂ − θ̂_(0.025)]	Bootstrap distribution of θ̂*−θ̂ approximates θ̂−θ	First-order	Biased statistics where percentile misbehaves
Studentized (boot-t)	[θ̂ − t_(0.975)·ŜE, θ̂ − t_(0.025)·ŜE]	Can estimate SE of θ̂ for each replicate	Second-order (most accurate)	High-accuracy work; cost is O(B²) due to nested bootstrap
BCa (Bias-Corrected + Accelerated)	Adjusted percentiles using bias z₀ and acceleration a	Smooth, monotone transformation makes distribution normal	Second-order, transformation-invariant	Production default — what R's `boot` package uses

BCa Confidence Interval Formulas

Non-Parametric Bootstrap with BCa Confidence Interval

pythonbootstrap_bca.py

import numpy as np
from scipy import stats

def bootstrap_bca(data: np.ndarray, statistic, B: int = 10000,
                  alpha: float = 0.05, rng_seed: int = 42):
    """
    BCa (Bias-Corrected and Accelerated) bootstrap CI.
    Efron (1987). Production default — transformation invariant and
    second-order accurate. Used in R's `boot::boot.ci(type='bca')`.
    """
    rng = np.random.default_rng(rng_seed)
    n = len(data)
    theta_hat = statistic(data)

    # 1. Bootstrap replicates
    indices = rng.integers(0, n, size=(B, n))
    theta_star = np.array([statistic(data[idx]) for idx in indices])

    # 2. Bias correction z0: share of bootstrap replicates below theta_hat
    prop_below = np.mean(theta_star < theta_hat)
    # Clip to avoid infinite z0 when prop_below is 0 or 1
    prop_below = np.clip(prop_below, 1 / B, 1 - 1 / B)
    z0 = stats.norm.ppf(prop_below)

    # 3. Acceleration a: via jackknife (leave-one-out)
    jack = np.array([statistic(np.delete(data, i)) for i in range(n)])
    jack_mean = jack.mean()
    num = np.sum((jack_mean - jack) ** 3)
    den = 6.0 * (np.sum((jack_mean - jack) ** 2) ** 1.5)
    a = num / den if den > 0 else 0.0

    # 4. Adjusted percentiles
    z_lo, z_hi = stats.norm.ppf(alpha / 2), stats.norm.ppf(1 - alpha / 2)
    alpha1 = stats.norm.cdf(z0 + (z0 + z_lo) / (1 - a * (z0 + z_lo)))
    alpha2 = stats.norm.cdf(z0 + (z0 + z_hi) / (1 - a * (z0 + z_hi)))

    ci_lo = np.quantile(theta_star, alpha1)
    ci_hi = np.quantile(theta_star, alpha2)

    return {
        "estimate": theta_hat,
        "se": theta_star.std(ddof=1),
        "ci_bca": (ci_lo, ci_hi),
        "bias_correction_z0": z0,
        "acceleration_a": a,
    }

# Example: 95% CI for the median of skewed revenue data
np.random.seed(0)
revenue = np.random.lognormal(mean=3.0, sigma=1.2, size=500)
result = bootstrap_bca(revenue, statistic=np.median, B=10000)
# -> estimate ~20.1, BCa CI ~(18.0, 22.3)

When the Bootstrap Fails — Four Traps

Bootstrap is not universal. The four failure modes below are common interview traps.

1. Extremes and order statistics. If θ = max(X), the bootstrap estimate max(X*) can never exceed max(X) — every resample is drawn from the original sample. The bootstrap distribution has a point mass at the true max with high probability, grossly underestimating variance. Frangos-Schucany (1990) formalized this. Fix: extreme value theory (GEV distributions) or subsampling.

2. Dependent data. I.i.d. bootstrap destroys autocorrelation. Apply it to time-series and your SEs will be dramatically too small. Fix: moving block bootstrap (Kunsch 1989) resamples contiguous blocks of length ℓ, preserving local dependence. Stationary bootstrap (Politis-Romano 1994) uses random block lengths from a geometric distribution. Block length ℓ ≈ n^(1/3) is a starting heuristic; Politis-Romano provides a data-driven procedure.

3. Very small samples (n < 10). With n = 5, there are only C(9,5) = 126 distinct bootstrap samples. The bootstrap distribution is discrete and coarse. Use exact methods (permutation, Fisher's exact) or parametric bootstrap.

4. Heavy tails with infinite variance. For Pareto-like distributions with shape α < 2, the variance is infinite and the CLT does not apply. Bootstrap CIs for the mean have incorrect coverage. Fix: log-transform, trimmed means, or m-out-of-n subsampling.

Block Bootstrap for Time Series — Preserving Dependence

Rendering diagram...

Permutation Tests — The Right Tool for Hypothesis Testing

Bootstrap CIs and permutation tests are often confused. They answer different questions.

Bootstrap estimates the sampling distribution of a statistic given the data. It is about uncertainty of the estimate and natively produces CIs.

Permutation test directly tests a null hypothesis of exchangeability — under H₀, the group labels are irrelevant, so any permutation of labels is equally likely. Shuffle the labels, recompute the statistic, and the fraction of permutations with statistic ≥ observed gives the p-value. This is exact: no asymptotics, no distributional assumptions, valid at any sample size.

When to use permutation over bootstrap for testing: always, for two-sample mean differences, correlations, regression coefficients, and any two-group comparison. Bootstrap "tests" (e.g., check if 0 is in the bootstrap CI of θ̂₁ − θ̂₂) have subtly wrong Type I error under H₀ because the bootstrap distribution is centered around the observed difference, not around 0.

Limitation of permutation tests: they require exchangeability. If treatment has heterogeneous variance (e.g., unequal-variance Welch case), the vanilla permutation test loses validity — use studentized statistics or the Janssen-Pauls modification.

Two-Sample Permutation Test for Mean Difference

pythonpermutation_test.py

import numpy as np

def permutation_test_two_sample(x: np.ndarray, y: np.ndarray,
                                 n_perm: int = 10000,
                                 rng_seed: int = 42) -> dict:
    """
    Exact permutation test for H0: x and y from the same distribution.
    Tests difference in means; change the `statistic` call for other functionals.

    Used at Netflix/Stripe when A/B test metrics are ratios or quantiles
    where closed-form tests are not available.
    """
    rng = np.random.default_rng(rng_seed)
    observed = x.mean() - y.mean()
    combined = np.concatenate([x, y])
    n_x = len(x)
    n_total = len(combined)

    perm_stats = np.empty(n_perm)
    for i in range(n_perm):
        # Shuffle labels by reindexing
        idx = rng.permutation(n_total)
        perm_x = combined[idx[:n_x]]
        perm_y = combined[idx[n_x:]]
        perm_stats[i] = perm_x.mean() - perm_y.mean()

    # Two-sided p-value
    p_value = np.mean(np.abs(perm_stats) >= np.abs(observed))
    # Add 1/(n_perm+1) correction for Monte-Carlo null (Phipson-Smyth 2010)
    p_value_corrected = (np.sum(np.abs(perm_stats) >= np.abs(observed)) + 1) / (n_perm + 1)

    return {
        "observed_diff": observed,
        "p_value": p_value,
        "p_value_corrected": p_value_corrected,
        "null_distribution_mean": perm_stats.mean(),  # ~0 under H0
        "null_distribution_sd": perm_stats.std(ddof=1),
    }

# Example: A/B test on skewed revenue data — permutation vs t-test
rng = np.random.default_rng(0)
control = rng.lognormal(3.0, 1.2, 500)
treatment = rng.lognormal(3.05, 1.2, 500)  # 5% lift in log scale
out = permutation_test_two_sample(control, treatment, n_perm=10000)
# p_value ~ 0.04 — permutation captures heavy-tail behavior t-test misses

Bootstrap in Machine Learning — Bagging, OOB, Deep Ensembles

The bootstrap is not just a statistical tool — it is the foundation of major ML methods.

Bagging (Breiman 1996) — "bootstrap aggregating" — trains B models on B bootstrap samples of the training data and averages predictions. Each bootstrap sample contains ~63.2% of the training data, leaving ~36.8% as out-of-bag (OOB) samples for free, unbiased error estimation without a separate validation set. Random forests are bagging + random feature subsets at each split. For B = 500 trees, OOB error closely approximates 10-fold CV error at a fraction of the cost.

.632+ estimator (Efron 1997) generalizes OOB. Naive OOB is slightly pessimistic (trains on 63.2% of data); .632+ combines training error and OOB error with data-dependent weights to de-bias, and is the theoretically preferred small-n alternative to CV.

Uncertainty quantification for ML. Bootstrap the training set → train B models → get a distribution over predictions at each test point. Practical for linear models, trees, and small neural nets. For deep learning, the bootstrap is prohibitively expensive, but deep ensembles (Lakshminarayanan et al., 2017) — training B networks from different random initializations on the same data — emerged as the DL analog and remain the strongest baseline for uncertainty estimation, outperforming MC-Dropout and variational methods in most benchmarks.

Bootstrap CIs for ML metrics (AUC, precision@k, NDCG@10) are the standard approach — no closed-form SE exists for most ranking metrics, so bootstrap the test set and recompute.

Bootstrap vs Permutation vs Parametric CI — Decision Matrix

Method	Goal	Key Assumption	Accuracy	Cost	When It Fails
Parametric CI (t, z)	CI for mean/proportion	Normality or CLT applicable; known SE formula	Exact if assumption holds	O(1)	Non-normal, heavy-tailed, quantiles, ratios
Non-parametric Bootstrap	CI for any smooth statistic	i.i.d. data, finite variance, smooth functional	First-to-second order	O(B·cost(θ̂))	Extremes, dependence, n<10, infinite variance
Block Bootstrap	CI for time-series statistic	Stationarity, mixing	First-order	O(B·cost(θ̂))	Non-stationary, structural breaks
Permutation Test	P-value for H0: exchangeability	Exchangeability under H0	Exact	O(P·cost(θ̂))	Unequal variances (use studentized), paired data
Jackknife	Bias + SE for smooth statistic	Smooth functional, i.i.d.	First-order, less efficient than bootstrap	O(n·cost(θ̂))	Non-smooth statistics (median with even n)

Jackknife — The Cheaper Cousin

Jackknife (Quenouille 1949, Tukey) predates the bootstrap by 30 years. For each i ∈ {1,...,n}, compute θ̂₍ᵢ₎ = T(X₋ᵢ) — the statistic with the i-th observation removed. The jackknife SE and bias estimators use these n leave-one-out replicates.

Advantages: only n computations (cheaper than B=10,000 bootstrap). Produces the acceleration parameter a in BCa via the third-moment formula. Historically the go-to bias-correction method.

Disadvantages: less efficient than bootstrap; fails catastrophically for non-smooth statistics like the median when n is even (the median jumps discontinuously as points are removed, breaking the smoothness required for consistency). For any modern application, prefer the bootstrap — jackknife's main role today is computing BCa's acceleration and out-of-bag error in random forests.

⚠ WARNING

Common Bootstrap Pitfalls in Production

1. Ignoring pairing. For paired data (before/after per user), you must resample pairs, not individual observations — otherwise you break the dependence that the paired design is exploiting, inflating variance.

2. Ignoring clustering. If users appear multiple times (multiple sessions, multiple purchases), resample at the user level (cluster bootstrap), not the row level. Row-level bootstrap undercounts within-user correlation and produces CIs that are too narrow — the #1 source of "significant" results that do not replicate.

3. Bootstrap for testing instead of permutation. If the question is "is A different from B under H₀," use a permutation test. Bootstrap CIs inverted into tests have the wrong reference distribution.

4. Too few bootstrap samples. B = 100 is insufficient for any CI work — tail quantiles are noisy. Use B ≥ 1,000 for 95% CIs and B ≥ 10,000 if you need 99% CIs or BCa acceleration estimates to be stable.

5. Using percentile CI when statistic is biased. If θ̂ is biased (e.g., the sample variance with the 1/n denominator), percentile CIs are wrong. BCa corrects for bias; basic CI does as well but loses the transformation-invariance of BCa.

Production Use at Netflix, Stripe, and Beyond

Netflix uses bootstrap for CIs on ranking metrics (NDCG, MRR) in offline experimentation, and for revenue-per-user CIs where the long-tail of subscription revenue violates t-test assumptions. Block bootstrap handles autocorrelated engagement metrics.

Stripe uses bootstrap for A/B test CIs on payment success rates by merchant segment (heavy-tailed transaction volume) and for ratio metrics like authorization rate per attempt, where the numerator and denominator are correlated and no closed-form SE exists.

Uber, DoorDash, LinkedIn apply cluster bootstrap at the user or city level to avoid the within-user correlation trap for metrics like bookings-per-session or jobs-viewed-per-visit.

General patterns: (1) bootstrap for all ratio metrics; (2) cluster bootstrap when the randomization unit is not the measurement unit; (3) block bootstrap for intraday seasonality or autocorrelated metrics; (4) permutation tests for hypothesis testing on any custom metric without a standard test; (5) B = 10,000 as the default for production CIs, computed asynchronously via Spark or Ray if the statistic is expensive.

TIP

What to Say in the Interview Tomorrow

"Bootstrap is Efron's 1979 plug-in method for simulating the sampling distribution of any statistic by resampling with replacement from the empirical distribution. It does not create new data — it quantifies what variability the existing n points imply about the estimand.

For production, I default to BCa CIs (Efron 1987) with B=10,000 because BCa corrects for bias and skewness and is transformation-invariant. I use percentile CIs only as a quick sanity check.

I know bootstrap fails for extremes (max, min), dependent data (use block bootstrap — Kunsch 1989), n<10, and infinite variance distributions. For hypothesis testing I reach for permutation tests, not bootstrap CIs — the reference distribution is correct under H₀.

In ML, bootstrap is the foundation of bagging (Breiman 1996) — each tree in a random forest is trained on a bootstrap sample, and the ~36.8% out-of-bag samples give free unbiased error estimates. For deep learning uncertainty, deep ensembles (Lakshminarayanan 2017) are the practical analog since bootstrapping full training runs is too expensive.

The trap I avoid: bootstrapping rows when data is clustered by user. That undercounts within-user correlation and produces CIs that are systematically too narrow — use cluster bootstrap at the randomization unit."

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →