Skip to main content
Machine Learning·Intermediate

Cross-Validation Strategies: K-Fold, Time Series, Nested CV, and Leakage-Proof Pipelines

The definitive guide to cross-validation for ML interviews. Covers stratified/group/time-series K-fold, nested CV for hyperparameter search, purged and embargoed CV (López de Prado), bootstrap .632+, and the leakage traps that silently inflate offline scores. Includes production-grade sklearn Pipelines and a from-scratch purged CV implementation.

50 min read 20 sections 8 interview questions
Cross-ValidationK-FoldStratified K-FoldGroup K-FoldTime Series CVPurged CVNested CVData LeakageHyperparameter TuningBootstrapModel Selectionsklearn PipelineTarget EncodingBergstra BengioLopez de Prado

Why a Single Train/Test Split Is Not Enough

A single 80/20 train/test split is a point estimate of generalization error with no uncertainty attached. On a 10K-sample dataset, swapping the random seed can shift the reported validation accuracy by 1–3 percentage points — more than the difference between most candidate models. That noise is indistinguishable from a real improvement.

Cross-validation (CV) solves two problems at once:

  1. Lower variance estimate of generalization error — averaging K held-out scores reduces the variance of the estimator by roughly a factor of K (though the folds are not independent, so the true reduction is less).
  2. Uncertainty quantification — the spread of per-fold scores gives you a confidence interval you can use to decide whether Model B is meaningfully better than Model A, not just lucky on one split.

The misconception most candidates have: CV is "just a way to use more data." It's not. CV is a bias-variance tradeoff on the estimator of the generalization error itself. K=2 has low variance but high bias (half the data is never trained on). K=n (leave-one-out) has low bias but very high variance. K=5 or 10 is the empirically-backed sweet spot (Kohavi 1995, ESL ch 7.10).

Interviewers test whether you understand CV as an estimator with its own bias and variance — not as a ritual you perform because cross_val_score exists.

IMPORTANT

What Interviewers Evaluate on Cross-Validation

The 6/10 answer: 'K-fold splits the data into K folds, trains on K-1, tests on 1, averages the scores.' Correct, but shallow.

The 9/10 answer hits four layers:

(1) CV variant selection as a function of data structure — stratified for class imbalance, group for user/patient grouping, time-series split for temporal data. Explicitly says 'standard K-fold is broken for time series — future leaks into past.'

(2) Leakage traps CV cannot save you from — fitting a scaler, imputer, target encoder, or SMOTE on the full dataset before CV splitting silently inflates scores by 1–10%. Fix: every target- or cross-row-aware transformation lives inside a sklearn Pipeline passed to cross_val_score.

(3) Nested CV when tuning hyperparameters on small data — single CV is biased for model selection because you overfit to the validation folds. Outer loop estimates generalization, inner loop tunes.

(4) Production reality — CV is a proxy. Online A/B is ground truth. The gap between CV-test and A/B (CV inflation) is often more diagnostic than the absolute score. A 2% CV improvement that yields 0% online lift usually means temporal or group leakage in the CV setup.

Clarifying Questions Before Choosing a CV Strategy

01

Is there a temporal dimension in the data?

If labels or features have timestamps (financial data, user behavior, sensor streams), standard K-fold is broken — random folds let the model see future rows during training. Use expanding/rolling window time-series split. Follow-up: do labels span multiple timestamps (e.g., a 5-day return label)? Then you need purged + embargoed CV to remove overlap leakage.

02

Are there natural groupings that must stay together?

If the same user, patient, device, or scene appears in multiple rows, standard K-fold will split the same entity across train and test — massively inflating scores. Use GroupKFold (or StratifiedGroupKFold for imbalanced labels). The CheXNet pneumonia dataset controversy is the canonical cautionary tale: patient-level leakage made the reported AUC uninterpretable.

03

Is the target variable imbalanced?

If positive rate < 20%, random K-fold can produce folds with 0 positives (or wildly varying positive rates), inflating score variance. Use StratifiedKFold — preserves the positive rate in every fold. Mandatory for fraud, churn, medical diagnosis, click prediction.

04

How much data do you have?

n < 1K: repeated stratified 5-fold (5–10 repeats) or bootstrap .632+ for tight confidence intervals. n = 1K–100K: standard 5- or 10-fold. n > 100K: a fixed 60/20/20 train/val/test with a single split is fine — the variance of the estimator is already small, and CV's compute cost isn't worth it.

05

Are you also tuning hyperparameters?

If yes, single CV gives an optimistically biased performance estimate (you've selected the hyperparameters that happened to do well on the validation folds). Use nested CV: outer loop for unbiased generalization estimate, inner loop for hyperparameter selection. Skip nested CV only if you have a separate held-out test set that was never touched during tuning.

06

What does a production failure look like?

If the downstream system is real-time ranking, you care about drift and distributional shift — use rolling-window CV that mimics production retraining cadence. If it's a batch forecasting system, use anchored/expanding-window backtests that match production re-fitting schedule. The CV design should mirror how the model will be retrained in production, not just maximize offline score.

The Bias-Variance Tradeoff of the CV Estimator Itself

This is the insight most candidates never articulate: CV is a statistical estimator, and like any estimator it has its own bias and variance. The choice of K controls where you sit on that tradeoff.

Bias of the CV estimator comes from the fact that each training fold has only n·(K-1)/K samples — slightly less data than the final model will be trained on. For K=2, you train on half the data → the model is significantly worse than the final one → CV error overestimates true generalization error. As K grows, bias shrinks: K=n (leave-one-out, LOO) is nearly unbiased because each training fold has n-1 samples.

Variance of the CV estimator has two sources: (1) variance from the test folds (small test folds → noisy per-fold scores), and (2) correlation between folds (neighboring folds share most training data → their errors are highly correlated → averaging them reduces variance less than you'd hope). LOO is the worst case: every pair of training sets differs by only 2 samples → errors are nearly perfectly correlated → variance of the average is close to variance of a single fold.

The empirical winner: K=5 or 10, established by Kohavi's 1995 study and confirmed in ESL ch 7.10. Low enough K that folds are not too correlated; high enough that bias is small.

Interview trap: "LOO gives the best estimate, right?" — No. LOO has nearly unbiased mean but higher variance than 10-fold CV. For a given model, a single run of 10-fold CV gives a tighter confidence interval than a single run of LOO. LOO also costs n model fits, not 10.

Bias-Variance of the K-Fold CV Estimator

CV Variant Decision Matrix — Which Strategy for Which Scenario

Data CharacteristicRecommended CVWhyProduction Example
Tabular, balanced labels, n > 1KStandard K-Fold (K=5)Low variance estimator, cheap, widely understoodAmes housing regression
Imbalanced classification (< 20% positive)Stratified K-Fold (K=5)Preserves class ratio in every fold; prevents zero-positive foldsFraud detection (0.1% positives), churn (5%)
Multiple rows per user/patient/deviceGroup K-Fold or Stratified Group K-FoldKeeps entity intact; prevents identity leakageMedical imaging (CheXNet), recsys user-level split
Time-series, single label per rowTimeSeriesSplit (expanding window)No future-to-past leakage; mimics production retrainingDemand forecasting, stock price prediction
Time-series, overlapping labels (t..t+h returns)Purged + embargoed CV (López de Prado)Label windows overlap → purge training samples whose label window crosses test window, embargo bleeds afterQuantitative finance (5-day forward returns)
Very small data (n < 200)Repeated Stratified K-Fold or Bootstrap .632+Averaging over repeats reduces seed variance; .632+ corrects bootstrap optimismMedical studies, rare-disease classification
Hyperparameter tuning on small data (n < 5K)Nested CV (outer 5-fold × inner 3-fold)Outer gives unbiased generalization; inner selects hyperparamsAcademic benchmarks, Kaggle with small train
n > 100K, no grouping/time structureSingle 60/20/20 splitCV estimator variance already tiny; K-fold compute not worth itIndustrial recsys, ad click prediction

Three CV Strategies Visualized — Standard K-Fold, Time-Series Expanding, Purged & Embargoed

Rendering diagram...

The K-Fold Variants — When Each Is Mandatory

Standard K-Fold (sklearn.KFold): shuffles row indices, partitions into K equal folds. Valid only when rows are independent and identically distributed (i.i.d.) and labels are well-balanced. In practice, this assumption holds less often than textbooks imply.

Stratified K-Fold (StratifiedKFold): preserves the class distribution in every fold. Mandatory for any imbalanced classification — skipping it on a 1%-positive fraud dataset routinely produces folds with 0 positives, giving undefined precision/recall and wildly variable AUC. Also useful for regression via binning the target into quantiles (StratifiedKFold on discretized y).

Repeated K-Fold (RepeatedStratifiedKFold): runs K-fold multiple times with different random seeds, averages across all runs. Reduces seed-sensitivity when n is small. A typical config is 5-fold × 10 repeats = 50 model fits. Use this when the single-run variance exceeds the effect size you're trying to measure.

Leave-One-Out (LOO): K = n. Nearly unbiased but high-variance estimator; costs n fits. Useful only when n is very small (say, < 100) and each sample is precious. For deep learning, LOO is essentially never viable.

Group K-Fold (GroupKFold): takes a groups array; ensures all rows with the same group ID land in the same fold. Mandatory when the same entity (user, patient, scene, session) appears in multiple rows. Classic failures without group-aware splitting: medical imaging models trained on patient A's images tested on patient A's other images → 98% AUC offline, 60% in deployment. The CheXNet debate surfaced exactly this issue — whether the original train/test split leaked patients across folds.

Stratified Group K-Fold (StratifiedGroupKFold, sklearn ≥ 1.0): combines both — preserves class balance across folds while keeping groups intact. Use for imbalanced medical/recsys data.

Time-Series CV — Why Random K-Fold Is Actively Broken

For time-ordered data, standard K-fold leaks future information into training. A fraud model "trained" on May data and "tested" on April data has seen the future — it knows which accounts will later be flagged and can use that signal via any feature with temporal correlation. Offline AUC looks amazing; production AUC collapses.

TimeSeriesSplit (expanding window): split 1 = train on [0, 0.2n], test on [0.2n, 0.3n]; split 2 = train on [0, 0.3n], test on [0.3n, 0.4n]; and so on. The training window grows each split — mirrors a production system that accumulates history and retrains periodically. This is the default for forecasting benchmarks.

Rolling/sliding window: fixed-size training window that slides forward. Train on [0, w], test on [w, w+h]; then train on [step, w+step], test on [w+step, w+h+step]. Use this when you suspect concept drift — old data actively hurts, so you want to forget it. Rolling window directly tests the "will this model still work 6 months from now?" question.

Purged + Embargoed CV (Marcos López de Prado, Advances in Financial Machine Learning, 2018): the gold standard when labels span multiple timestamps. A 5-day forward-return label for row at time t actually depends on data through t+5. If your test block covers [t, t+3], any training row with label window crossing [t, t+3] has seen the same price path as the test — that's leakage, and naive TimeSeriesSplit misses it entirely. Fix: (1) purge — remove training rows whose label window overlaps the test period; (2) embargo — additionally remove training rows in a small window immediately after the test block (typically 1% of n), preventing serial-correlation leakage.

Combinatorial Purged CV (López de Prado, ch 12): splits the timeline into N blocks and forms all C(N,k) combinations of k test blocks. Gives multiple non-overlapping backtest paths, enabling rigorous backtest statistics (Sharpe ratio distributions under the null) rather than a single point estimate. The go-to method for serious quant strategy evaluation.

Purged + Embargoed Time-Series CV from Scratch

pythonpurged_cv.py
import numpy as np
import pandas as pd
from typing import Iterator, Tuple

def purged_embargoed_kfold(
    t1: pd.Series,          # per-row label end-time (t1[i] = when label[i] becomes known)
    n_splits: int = 5,
    embargo_pct: float = 0.01,
) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
    """
    Purged + embargoed CV (Lopez de Prado, Advances in Financial ML, ch 7).

    For each row i:
      - Feature is available at index t0[i] (taken as i, the row position)
      - Label is fully observed at t1[i]  (label window [t0[i], t1[i]])

    Purge: drop train rows whose [t0, t1] overlaps the test block.
    Embargo: additionally drop train rows in a window of size embargo_pct * n
             starting right after the test block, blocking serial-correlation leakage.

    Yields (train_idx, test_idx) per split, like sklearn's KFold.
    """
    n = len(t1)
    indices = np.arange(n)
    test_starts = np.array_split(indices, n_splits)
    embargo = int(n * embargo_pct)

    for test_block in test_starts:
        test_start, test_end = test_block[0], test_block[-1]
        test_idx = test_block

        # Candidate train set: everything outside the test block
        train_mask = np.ones(n, dtype=bool)
        train_mask[test_idx] = False

        # PURGE: drop train rows whose label window [t0, t1] overlaps test block
        # i.e. label observed after test_start AND feature taken before test_end
        test_time_start = test_start
        test_time_end   = test_end
        overlaps = (t1.values >= test_time_start) & (indices <= test_time_end)
        train_mask &= ~overlaps

        # EMBARGO: drop train rows within embargo window after test block
        if embargo > 0:
            embargo_end = min(n, test_end + 1 + embargo)
            train_mask[test_end + 1 : embargo_end] = False

        train_idx = indices[train_mask]
        yield train_idx, test_idx


# Usage: label[i] is a 5-day forward return => t1[i] = i + 5
df = pd.DataFrame({"price": np.random.randn(1000)})
df["t1"] = df.index + 5  # 5-step-ahead label

for fold, (tr, te) in enumerate(
    purged_embargoed_kfold(df["t1"], n_splits=5, embargo_pct=0.01)
):
    print(f"Fold {fold}: train={len(tr)}, test={len(te)}, "
          f"purged={1000 - len(tr) - len(te)}")

Nested CV — The Only Honest Answer for Small-Data Hyperparameter Tuning

The problem with single CV for hyperparameter tuning: if you run 5-fold CV for 50 hyperparameter configurations and report the best score, you have selected the configuration that happened to perform best on your specific fold assignments. That score is an optimistically biased estimate of true generalization — you've implicitly overfit to the validation folds.

Varma & Simon (2006) and Cawley & Talbot (2010) showed this bias can reach 5–15% of the reported metric on small datasets. This is exactly the regime where many academic/medical ML results fail to replicate.

Nested CV fixes this: two nested loops.

  • Outer loop (e.g., 5-fold): each outer test fold gives one unbiased estimate of generalization. Model selection/tuning happens inside the outer training fold only; the outer test fold is truly untouched.
  • Inner loop (e.g., 3-fold) runs inside each outer training set to select hyperparameters. The selected hyperparameters are then refit on the full outer training fold and evaluated on the outer test fold.

Total cost: outer_K × inner_K × n_hyperparams model fits. For 5-outer × 3-inner × 50 configs = 750 fits. Expensive, but it's the only unbiased way to simultaneously (a) tune hyperparameters and (b) report an honest generalization estimate on small data.

When nested CV is overkill: n > 100K with a held-out test set that was split off before any tuning. In that regime, a single 60/20/20 split with the test set sealed in a vault until the end gives you the unbiased estimate for free, and simple 5-fold on the 80% remainder is enough for tuning. Save the compute.

Nested CV — Outer Loop for Generalization, Inner Loop for Hyperparameters

Rendering diagram...
⚠ WARNING

Leakage Traps CV Cannot Save You From

Every one of these silently inflates offline scores by 1–10% without raising an error. All of them involve fitting a data-aware transformation on the full dataset before CV splits — the model sees validation-row information through the transformer.

(1) Scaler/Normalizer: calling StandardScaler().fit(X) before cross_val_score leaks validation means/stds into training. Fix: wrap in Pipeline([('scaler', StandardScaler()), ('clf', model)]) and pass the pipeline to CV — sklearn refits the scaler inside each fold automatically.

(2) Imputer: SimpleImputer(strategy='mean').fit(X) uses validation rows to compute the mean. Same Pipeline fix.

(3) Target encoder: df['city_enc'] = df.groupby('city')['target'].mean() is the #1 leakage source in Kaggle competitions. The encoding value for each row includes its own label. Fix: out-of-fold target encoding (encode row i using target statistics computed without row i's fold) or the category_encoders.TargetEncoder class, placed inside a Pipeline so it re-fits per CV fold.

(4) SMOTE / oversampling: SMOTE().fit_resample(X, y) before CV oversamples from validation rows into training. Use imbalanced-learn's Pipeline (supports resampling steps) inside cross_val_score.

(5) Feature selection: SelectKBest(k=100).fit(X, y) uses validation-row y values to pick features. Every feature selector must live inside the Pipeline.

(6) Group leakage: same user/patient in train and test (covered in group K-fold section). CV variant choice — not Pipeline — fixes this one.

(7) Temporal leakage: future-derived features (e.g., last_30_days_clicks computed across the full dataset). CV variant choice + feature engineering audit fix this one.

Leakage-Proof Pipeline — Target Encoding Inside CV

pythonleakage_proof_cv.py
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from category_encoders import TargetEncoder  # pip install category_encoders

# Example: fraud detection — categorical city + numerical amount
# Target rate ~2% → MUST use StratifiedKFold
df = pd.DataFrame({
    "city": np.random.choice(["NYC", "SF", "LA", "SEA"], size=10_000),
    "amount": np.random.lognormal(3, 1, size=10_000),
    "y": np.random.binomial(1, 0.02, size=10_000),  # 2% positive rate
})
X, y = df[["city", "amount"]], df["y"]

# Build a pipeline where EVERY data-aware transformation is re-fit per CV fold
preprocess = ColumnTransformer([
    # Target encoding — leakage-proof because pipeline refits inside each fold
    ("city_enc", TargetEncoder(smoothing=10.0), ["city"]),
    # Scaler also refits per fold — no test-set mean leaks in
    ("amount_scaled", Pipeline([
        ("imp", SimpleImputer(strategy="median")),
        ("scl", StandardScaler()),
    ]), ["amount"]),
])

pipe = Pipeline([
    ("preprocess", preprocess),
    ("clf", GradientBoostingClassifier(n_estimators=200, max_depth=3)),
])

# StratifiedKFold because positive rate = 2% — random KFold would give unstable AUC
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="roc_auc", n_jobs=-1)

print(f"AUC: {scores.mean():.4f} ± {scores.std():.4f}")
# Spread across folds (std) is as informative as the mean:
# std > 0.01 on n=10K → high variance, consider repeated CV or more data

Leakage Type x Which CV Strategy (and Pipeline) Fixes It

Leakage TypeExampleDoes CV Variant Fix It?Does Pipeline Fix It?Correct Fix
Preprocessing leakageScaler/imputer fit on full datasetNoYesPut transformer inside sklearn Pipeline
Target encoder leakageCity mean-target encoded on full dfNoYes (with category_encoders in Pipeline)Out-of-fold target encoding inside Pipeline
SMOTE leakageOversampling before CV splitNoYes (with imbalanced-learn Pipeline)SMOTE step inside imblearn Pipeline
Feature selection leakageSelectKBest(y) on full dataNoYesSelector inside Pipeline
Group leakageSame user in train + testYesNoGroupKFold / StratifiedGroupKFold
Temporal leakage (i.i.d. split)Random KFold on time-seriesYesNoTimeSeriesSplit (expanding/rolling)
Overlapping label leakage5-day return labels across splitsYes (purged+embargoed)NoPurged + embargoed CV (López de Prado)
Hyperparameter selection biasReport best CV score over 50 configsYes (nested CV)NoNested CV or sealed test set

Hyperparameter Search — Grid vs Random vs Bayesian vs Successive Halving

CV scores only matter if you search the hyperparameter space efficiently. Four options, in rough order of sophistication:

Grid search (GridSearchCV): evaluates every combination on a predefined grid. Simple but scales exponentially with the number of hyperparameters and spends most of its compute on configurations that are obviously bad. Use only when the grid is tiny (< 20 configurations).

Random search (RandomizedSearchCV): samples configurations from distributions over each hyperparameter. Bergstra & Bengio (2012) proved that for problems where only a few hyperparameters actually matter (nearly always), random search with budget B finds better configurations than grid search with the same budget. Rule of thumb: random search with 60 samples has ≥95% probability of finding a config in the top 5% of the grid. Prefer random over grid by default.

Bayesian optimization (Optuna, Hyperopt, scikit-optimize): fits a probabilistic surrogate (Gaussian process or TPE) to past CV scores and picks the next configuration to maximize expected improvement. Shines when each fit is expensive (> 1 min) and the budget is moderate (50–200 trials). Overkill when fits are cheap — the GP overhead dominates.

Successive Halving (HalvingRandomSearchCV, Hyperband): starts with many configurations on a small budget (small fold size or few epochs), keeps the top third, doubles the budget, repeats. Asymptotically optimal when early performance predicts final performance — usually true for iterative learners like XGBoost or neural nets. This is the default for large-scale HPO at FAANG.

Critical detail for iterative learners: when using early stopping inside folds, don't average the "best iteration" across folds — use max of best iterations (or tune n_estimators explicitly without early stopping in the final refit). Averaging undercounts because some folds converge slower but would still benefit from more iterations; using max approximates the production retraining behavior better.

Small-Data and Imbalanced — Repeated CV and the .632+ Bootstrap

With n < 1K, single-run 5-fold CV has uncomfortably wide confidence intervals — a single lucky fold can swing the reported mean by several percent. Two approaches tighten the estimate.

Repeated stratified K-fold (RepeatedStratifiedKFold(n_splits=5, n_repeats=10)): run 5-fold CV ten times with different seeds, average all 50 held-out scores. Reduces variance roughly by √10 ≈ 3×. Cheap and well-understood. First choice for small-to-medium imbalanced data.

Bootstrap .632+ (Efron & Tibshirani 1997): draws B bootstrap samples (with replacement) from n rows. For each bootstrap, trains on the sample, tests on the ~36.8% of rows not drawn (out-of-bag). The naive bootstrap estimate is biased (training on n rows with duplicates is not quite the same as training on n distinct rows); .632+ corrects for the optimism of training error using a weighted combination:

  • err(.632+) = (1 − w) · train_err + w · oob_err, where w adjusts based on the no-information error rate (relative to a random classifier).

When to use it: (1) n < 200, where even repeated CV has too few test points per fold for tight intervals; (2) when you need a smooth CDF of error estimates for formal hypothesis testing. Downside: the math is subtle, and for classification the straight .632 (without +) is known to be biased under severe overfitting — always use .632+.

For most production ML with n > 1K, repeated stratified K-fold is the right default. Reserve .632+ for very small benchmark studies.

⚠ WARNING

Three Production Failures CV Will Not Catch

Even a perfectly-designed CV pipeline leaves three classes of production failures untested:

(1) Training-serving skew: features are computed differently at training time vs inference time. CV cannot detect this because CV uses the same feature computation for both train and validation folds. Detection requires comparing the distribution of features at serving time against the training distribution (PSI per feature > 0.25 = alarm). Fix: a feature store that shares feature computation code between train and serve (Uber Michelangelo, Airbnb Zipline, Feast).

(2) Concept drift: the relationship P(y|x) changes after deployment. CV evaluates against held-out data from the same time period as training — it cannot see the future. Detection requires rolling-window backtests and online monitoring of model calibration. A model that was well-calibrated in CV but whose ECE climbs above 0.05 in production is drifting.

(3) Population shift: CV is on desktop-US users, production serves global mobile users. Different distribution P(x). Detection: PSI per feature, and — more definitively — shadow-mode deployment where the new model scores live traffic without affecting users. Shadow metrics + A/B are the only way to catch this before a full rollout.

The takeaway: a 2% CV improvement that yields 0% online lift usually means one of these three failed to surface during offline evaluation. Always triangulate offline CV with shadow-mode scoring and a proper A/B test.

Offline CV Score vs. Online A/B — Which Gap Signals What

Offline ResultOnline A/B ResultLikely CauseNext Action
+2% CV AUC+1.5% online liftHealthy signal — CV reflects realityShip with monitoring; set up weekly CV refresh
+2% CV AUC0% online lift (flat)Temporal or group leakage in CV; or metric ≠ business outcomeAudit CV splits (group? time?); check metric alignment
+2% CV AUCNegative online impactSevere leakage, or novelty/learned-user-behavior effect, or calibration driftRoll back; inspect features learned; check shadow metrics
+0.5% CV AUC+2% online liftRare — usually means offline metric is noisy proxy (e.g., log loss vs revenue)Investigate; often reveals a better offline metric
CV std > 1%Any resultHigh-variance estimator — sample size too small for confident decisionUse repeated CV or collect more data before deciding
TIP

What to Say in the Interview Tomorrow

If you have two minutes to summarize cross-validation for an interviewer, hit this spine:

1. CV is an estimator with its own bias-variance tradeoff. K=5 or 10 is the empirical sweet spot (Kohavi 1995). LOO is nearly unbiased but high-variance and expensive — not 'the best' despite sounding like it.

2. The CV strategy must match the data structure. Stratified for imbalance (mandatory). Group for user/patient-level data (CheXNet-style leakage). TimeSeriesSplit for temporal data. Purged + embargoed (López de Prado) when labels span multiple timestamps.

3. CV cannot fix preprocessing leakage. Every transformation that uses target or cross-row information — scaler, imputer, target encoder, SMOTE, feature selector — must live inside a sklearn Pipeline passed to cross_val_score. Fitting on the full dataset before CV is the #1 silent source of inflated scores.

4. Nested CV for hyperparameter tuning on small data. Single CV is optimistically biased after you select the best config — Varma & Simon 2006 document 5–15% bias on small sets. Outer 5-fold × inner 3-fold is the canonical setup. Skip it only when you have a sealed test set from a 60/20/20 split on large data.

5. Offline CV is a proxy; A/B is ground truth. Design CV to mimic production retraining cadence, and always triangulate with shadow mode + A/B. The CV-to-A/B gap is often more diagnostic than the absolute score.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →