Skip to main content
Machine Learning·Intermediate

Feature Engineering: Leakage-Safe Encoding, Interactions, Temporal, and Production Parity

The production-grade feature engineering playbook. Covers categorical encoding by cardinality (one-hot, target/mean with K-fold CV smoothing, hashing, embeddings, CatBoost ordered stats), numerical transforms, interactions, cyclical temporal features, the 4 sources of data leakage with concrete fixes (ColumnTransformer + Pipeline), missing-data strategies (MCAR/MAR/MNAR), feature selection (permutation importance vs biased gain importance), and the training-serving skew that silently destroys models in production.

60 min read 22 sections 8 interview questions
Feature EngineeringTarget EncodingK-FoldCatBoostFeature HashingCyclical EncodingData LeakageColumnTransformerPermutation ImportanceMissing DataMICEFeature StoreTraining-Serving SkewPoint-in-Time CorrectnessSHAP

Why Feature Engineering Still Decides Tabular and RecSys Models

"Feature engineering is dead" is true for computer vision and natural language — CNNs and transformers learn their own features end-to-end. For tabular data and recommendation systems, the opposite is true: a tuned LightGBM with thoughtful features routinely beats a TabNet or MLP on datasets below ~1M rows (Shwartz-Ziv & Armon, 2022). At the staff level, the interview signal is knowing which regime you are in and why.

The actual job of feature engineering has three parts, and most candidates only know the first:

  1. Representation — how you encode a column so the model can use it (one-hot vs target encoding vs embedding).
  2. Leakage control — making sure no future-or-out-of-fold information slips into training features. This is the silent killer: offline AUC is great, production is terrible.
  3. Training-serving parity — making sure the feature you computed at training time is bit-for-bit identical to the one served at inference. A "30-day click count" computed over full history for training but only the last 30 days for serving is a different feature with a different distribution — the model decays without an error log.

The single key insight: a feature is not a transformation of a column. A feature is a contract that must hold identically at training time, serving time, and at every historical point-in-time you join on. Everything in this topic — K-fold target encoding, ColumnTransformer inside a pipeline, feature stores — exists to enforce that contract.

IMPORTANT

What Separates a 9/10 Answer From a 6/10 Answer

A 6/10 candidate lists encoding methods: "one-hot for low cardinality, target encoding for high cardinality, log-transform skewed numerics." Correct, but generic.

A 9/10 candidate structures the answer around cardinality × model family × leakage risk:

  • "For cardinality < 15 and a linear/NN model, one-hot. For trees, I avoid one-hot at K > ~100 — sparse one-hot columns cause imbalanced splits and waste tree depth."
  • "For K > 1000 or unbounded (URLs, user IDs), hashing trick (HashingVectorizer, ~2^18 buckets) or an embedding layer — never target encoding, because rare categories have unstable statistics."
  • "Target/mean encoding in the 50–1000 range, but always with out-of-fold K-fold and Bayesian smoothing (n·ȳ_k + α·ȳ)/(n + α) with α ≈ 10–100. Without K-fold, training AUC jumps 5–10 points vs production — classic leakage."
  • "For time-series, I never use random K-fold for target encoding — I use expanding-window or CatBoost-style ordered target statistics, which are leakage-safe by construction."

That answer signals production experience. The interviewer writes "staff" next to your name.

Clarifying Questions Before Designing Features

01

What is the cardinality distribution of each categorical column?

Ask for df.nunique() or the equivalent. Cardinality determines encoding: < 15 (one-hot), 15–1000 (target encoding with smoothing), > 1000 or unbounded (hashing / embeddings). Staff candidates ask for the cardinality of the serving distribution, not just training — a column with 500 unique values in training can explode to 50K in production.

02

Is the task temporal? What is the label timestamp and the feature computation timestamp?

If labels have timestamps, random K-fold is wrong — use chronological splits and expanding windows. Ask whether each feature can be computed strictly before the label timestamp. 'User lifetime value' computed as of today leaks future data into past labels.

03

What is the model family?

Trees (XGBoost, LightGBM, RF) are scale-invariant and handle missing values natively — no standardization, no imputation needed. Linear / NN / k-NN / SVM require standardization and explicit missing handling. One-hot at high cardinality hurts trees but is required for linear models.

04

What is the entity grain and are there group-level correlations?

If multiple rows share a user_id or patient_id, random row split leaks between train and test. Use GroupKFold. Healthcare and recsys data almost always need this.

05

What is the serving latency budget and is the feature computable in-request?

A feature that requires a 30-day aggregation cannot be computed at 10ms p99 serve time. It must be materialized in an online store (Redis) by a background pipeline. The feature contract must specify: batch vs streaming vs request-time.

Leakage-Prone vs Leakage-Safe Feature Pipeline

Rendering diagram...

Categorical Encoding — Decide By Cardinality × Model Family

Categorical encoding is the most failure-prone part of feature engineering. The decision is not "one-hot vs target encoding" — it is a four-axis decision: cardinality, model family, leakage tolerance, and serving cost.

One-hot encoding (K < 15): creates K binary columns. Safe for linear models, neural nets, k-NN. Hurts trees above ~K=100 — sparse one-hot columns create splits with ~1/K of the data on one side, which is rarely the optimal split, so boosted trees waste depth finding them. LightGBM's categorical_feature parameter uses a native Fisher-optimal split that is ~1.5–2× more accurate than one-hot on K=50+ columns.

Label / ordinal encoding: map categories to integers 0..K-1. Only valid when (a) the model is a tree (which treats integers as ordered, so it can split x ≤ 3.5) AND (b) the categories are actually ordinal (size: S=0, M=1, L=2, XL=3) OR you don't care about the imposed order (trees can still split on any threshold). Never use label encoding for linear/NN models — you are forcing a linear ordering on nominal categories.

Frequency encoding: replace each category with its count. Surprisingly effective on trees because frequency is often correlated with target (rare merchants are more suspicious). Zero leakage risk. Good first-pass for K > 100 on tree models.

Feature hashing (K unbounded): hash(category) % 2^18. Fixed memory regardless of cardinality, works for streaming / online learning where new categories appear forever (URLs, user agents, device fingerprints). Downside: collisions. Use with linear models (Vowpal Wabbit built its reputation on this) or as input to an embedding.

Embedding lookup: a K × d matrix learned end-to-end. Standard for deep recommenders (YouTube, DLRM). Dimension heuristic: d ≈ min(50, K^0.25) (fast.ai) or d ≈ 6·K^0.25 (TensorFlow).

Target (mean) encoding + K-fold + smoothing (K in 50–1000): the most powerful and most dangerous encoding. Replace each category with the mean target of training rows in that category. Adds signal density in one column where one-hot would need 500. But computing this on the full training set leaks every row's label into its own feature — training AUC inflates 5–10 points, production collapses. Always use out-of-fold K-fold + Bayesian smoothing toward the global mean.

Target Encoding with Bayesian Smoothing

Leakage-Safe K-Fold Target Encoder (sklearn-compatible)

pythonkfold_target_encoder.py
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import KFold

class KFoldTargetEncoder(BaseEstimator, TransformerMixin):
    """
    Out-of-fold target encoder with Bayesian smoothing.
    Prevents label leakage by computing each fold's encoding
    using statistics from the OTHER folds only.
    """
    def __init__(self, cols, n_splits: int = 5, smoothing: float = 20.0,
                 random_state: int = 42):
        self.cols = cols
        self.n_splits = n_splits
        self.smoothing = smoothing
        self.random_state = random_state

    def fit(self, X: pd.DataFrame, y: pd.Series):
        # Global stats — used at transform time for unseen data
        self.global_mean_ = float(y.mean())
        self.mappings_ = {}
        for col in self.cols:
            agg = pd.DataFrame({col: X[col], "y": y.values}).groupby(col)["y"].agg(["mean", "count"])
            agg["smoothed"] = (
                agg["count"] * agg["mean"] + self.smoothing * self.global_mean_
            ) / (agg["count"] + self.smoothing)
            self.mappings_[col] = agg["smoothed"].to_dict()
        return self

    def fit_transform(self, X: pd.DataFrame, y: pd.Series):
        """In-training transform — uses K-fold to avoid leakage."""
        self.fit(X, y)
        out = X[self.cols].copy()
        kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state)
        encoded = np.zeros((len(X), len(self.cols)))
        global_mean = self.global_mean_

        for fold_idx, (tr, va) in enumerate(kf.split(X)):
            y_tr = y.iloc[tr]
            global_mean_tr = float(y_tr.mean())
            for ci, col in enumerate(self.cols):
                fold_stats = pd.DataFrame({
                    col: X[col].iloc[tr], "y": y_tr.values
                }).groupby(col)["y"].agg(["mean", "count"])
                smoothed = (
                    fold_stats["count"] * fold_stats["mean"]
                    + self.smoothing * global_mean_tr
                ) / (fold_stats["count"] + self.smoothing)
                mapping = smoothed.to_dict()
                encoded[va, ci] = X[col].iloc[va].map(mapping).fillna(global_mean_tr).values
        return encoded

    def transform(self, X: pd.DataFrame):
        """At inference — use the full-data mapping learned in fit()."""
        out = np.zeros((len(X), len(self.cols)))
        for ci, col in enumerate(self.cols):
            out[:, ci] = X[col].map(self.mappings_[col]).fillna(self.global_mean_).values
        return out
⚠ WARNING

CatBoost's Ordered Target Statistics — Leakage-Safe by Construction

CatBoost (Prokhorenkova et al., 2018) solved target-encoding leakage differently from K-fold. It imposes a random permutation of the training rows and computes each row's target encoding using only the rows earlier in the permutation:

encoded_i = (sum of y_j for j < i with same category + α·prior) / (count_j<i + α)

This is an online / streaming target encoding. The first few rows get almost no signal (high variance, falls back to prior); later rows get the full benefit. Multiple permutations are averaged to reduce variance. No K-fold leakage is possible because no row's own label ever enters its encoding.

Why this matters for interviews: when asked "how would you use categorical features at 10K cardinality," mentioning CatBoost's ordered target stats signals you've read the paper. It also means: if you use CatBoost, you get leakage-safe target encoding for free — no ColumnTransformer plumbing required. Trade-off: CatBoost is slower than LightGBM, and the ordered mechanism adds some variance vs well-tuned K-fold target encoding.

Categorical Encoding Decision Matrix (Cardinality × Model Family)

Cardinality KModel = Linear / NNModel = Tree (XGB / LGBM)Leakage RiskExample
K < 15One-hot encodingOne-hot OR native categorical (LGBM `categorical_feature`)Nonecountry_code, day_of_week
15 – 100One-hot (large but manageable)Native categorical or frequency encodingNoneproduct_category, browser
100 – 1000Target encoding + K-fold + smoothing (α=20)Native categorical or target encodingHIGH if no K-foldzip_code, merchant_id
1000 – 100KEmbedding (d ≈ 16–64) or hashing (2^16)Target encoding + smoothing OR embeddingHIGH — prefer hashing if K growsuser_id, product_sku
> 100K or unboundedHashing trick (2^18) or embeddingHashing trick or LightGBM with `max_bin` tuningNone for hashing; collisions insteadURL, user-agent, query string
Time-series any KCatBoost ordered target stats OR expanding-window encodingSameCRITICAL — random K-fold fails on time seriessession_id with temporal labels

Numerical Feature Transforms — Model Family Dictates the Answer

The first question to ask: do I need to transform numerics at all? For trees (XGBoost, LightGBM, Random Forest), scaling is a no-op — tree splits are threshold-based and scale-invariant. Standardizing age before LightGBM is wasted engineering effort. For everything else, transforms matter.

StandardScaler (subtract mean, divide by std): default for linear models, SVM, k-NN, PCA, neural networks. Assumes roughly Gaussian distributions. Breaks down with heavy outliers — one outlier inflates σ and crushes the signal for the other 99% of rows.

RobustScaler (subtract median, divide by IQR): use when outliers are real and you want to keep them. Credit-card transaction amounts, income, page-view counts — all heavy-tailed.

MinMaxScaler (scale to [0, 1]): use when the algorithm requires bounded inputs (image pixels for a CNN, variational autoencoders). Do not use as a default — a single new outlier in production crushes the whole range.

Log / Box-Cox / Yeo-Johnson for right-skewed positives (income, price, latency, counts). Log is simplest and invertible. Box-Cox requires strictly positive data; Yeo-Johnson is the generalization that handles zeros and negatives. A log-transform on a right-skewed feature often turns a mediocre linear model into a strong one — the same feature in a tree changes nothing because trees can approximate monotonic transforms on their own.

Binning: discretize a numeric into bins. Three variants: equal-width (fast, bad for skewed data), equal-frequency / quantile-based (robust to skew — each bin has the same number of rows), and supervised / tree-based (use a single decision tree to find target-optimal cut points). Useful for linear models that need to capture non-linearities without degree-2 polynomials. Not useful for trees.

Rank / quantile transformation: map each value to its rank. Collapses all monotone transforms to the same representation, making the model strictly invariant to outliers at the cost of losing magnitude information. Common in LightGBM pipelines for very heavy-tailed features.

Feature Interactions and Crosses — Why Trees Get Them Free

Trees capture interactions automatically. A decision tree split age > 40 followed by income > 100K inside that branch is exactly an (age, income) interaction — the model learned that high-income-and-over-40 behaves differently than either condition alone. This is why GBDTs on tabular data rarely need explicit interaction features.

Linear models and shallow NNs do not. A linear model y = w1·age + w2·income + b has no cross term. You must manufacture it. Two approaches:

Polynomial features (degree 2): compute all pairwise products x_i · x_j. With d features, you get O(d²) new features — explodes fast. Rarely use degree > 2. Useful when you have ~10–50 features and suspect non-linearity. For sparse one-hot inputs, use PolynomialFeatures(interaction_only=True) so you don't generate squares of binary indicators.

Feature crosses (categorical × categorical): concatenate two categorical values into a single new category. Example: city="SF" × day="Mon" → "SF_Mon". Facebook's 2014 ad-CTR paper (He et al., "Practical Lessons from Predicting Clicks on Ads at Facebook") showed that crossing categorical features and feeding them into a logistic regression on top of GBDT-derived leaf indicators lifted CTR significantly over either model alone — this "GBDT + LR" architecture became the default for ad-click prediction for half a decade.

When explicit crosses still matter for trees: when cardinality of either side is high, the tree may not find the combination inside its depth budget. A pre-computed cross exposes the combination at depth 1 instead of depth 6. For a depth-6 LightGBM model, only features in the top ~50 splits get used; a manual cross forces a promising combination to the front.

Temporal Features — Cyclical, Lag, Rolling, and Time-Since

Temporal features are where leakage is most common and most catastrophic. A few patterns, each with a specific purpose:

Cyclical (sin / cos) encoding: hour of day, day of week, month, day of year are cyclical — hour 23 is close to hour 0, not 23× further from it. Encode as two features: sin(2π·hour/24) and cos(2π·hour/24). This lets any model (especially linear/NN) learn periodic patterns without treating hour=0 and hour=23 as far apart. Trees can learn thresholds on raw hour but benefit less from this transform.

Lag features: y_{t-1}, y_{t-7}, y_{t-30}. The single strongest predictor of tomorrow's sales is today's sales. For retail, finance, and demand forecasting, lagged features are the entire game. Check: every lag must be strictly in the past relative to the label timestamp — using y_{t-0} is a bug.

Rolling windows: mean(y, window=7d), std(y, window=30d). Captures trend and volatility. Use shift(1) before rolling to avoid the current day leaking in: df["rolling_7d"] = df["y"].shift(1).rolling(7).mean().

Exponentially-weighted moving average (EWMA): v_t = α·y_{t-1} + (1-α)·v_{t-1}. Smoother than rolling window, responds faster to recent changes. α ∈ [0.1, 0.5] typical.

Time-since-last-event: seconds since user's last purchase, login, click. Often one of the top-5 features in a churn or recsys model. Captures recency in a way that raw timestamps cannot.

Calendar features: day-of-week, is_holiday, is_weekend, is_month_end, is_payday. Especially strong for retail and financial data.

Cyclical Time Encoding + Permutation Importance

pythontemporal_features_and_importance.py
import numpy as np
import pandas as pd
from sklearn.inspection import permutation_importance
from sklearn.ensemble import GradientBoostingClassifier

def add_cyclical_time_features(df: pd.DataFrame, ts_col: str) -> pd.DataFrame:
    """Encode hour/day-of-week/month as sin/cos pairs + raw."""
    ts = pd.to_datetime(df[ts_col])
    df = df.copy()
    df["hour"] = ts.dt.hour
    df["dow"] = ts.dt.dayofweek
    df["month"] = ts.dt.month
    # Cyclical encoding: hour 23 and hour 0 are close in (sin, cos) space
    df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
    df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
    df["dow_sin"] = np.sin(2 * np.pi * df["dow"] / 7)
    df["dow_cos"] = np.cos(2 * np.pi * df["dow"] / 7)
    df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
    df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)
    return df

def add_lag_and_rolling(df: pd.DataFrame, group_col: str, target_col: str,
                        lags=(1, 7, 30), windows=(7, 30)) -> pd.DataFrame:
    """Shift(1) first so the current row's target never leaks."""
    df = df.sort_values([group_col, "timestamp"]).copy()
    grp = df.groupby(group_col)[target_col]
    for lag in lags:
        df[f"{target_col}_lag_{lag}"] = grp.shift(lag)
    # rolling over the shifted series — current row is excluded by design
    shifted = grp.shift(1)
    for w in windows:
        df[f"{target_col}_rollmean_{w}"] = (
            shifted.rolling(w, min_periods=1).mean().reset_index(drop=True)
        )
    return df

def rank_features(model, X_val, y_val, n_repeats: int = 10):
    """Permutation importance — unbiased vs gain-based importance which
    is biased toward high-cardinality features (Strobl et al., 2007)."""
    result = permutation_importance(
        model, X_val, y_val, n_repeats=n_repeats,
        random_state=42, scoring="roc_auc", n_jobs=-1,
    )
    return (
        pd.DataFrame({
            "feature": X_val.columns,
            "mean_drop_auc": result.importances_mean,
            "std": result.importances_std,
        })
        .sort_values("mean_drop_auc", ascending=False)
    )

Text Features Before You Need a Transformer

For tabular problems where a short text column (product title, search query, job description) is one of many features, you rarely need a BERT embedding. Start lightweight:

TF-IDF on unigrams and bigrams: sklearn.feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2), max_features=50_000, min_df=5). For a LightGBM or linear model, this is often 80% of what a transformer gives at 1% of the cost. Output is sparse; use scipy.sparse throughout.

Character n-grams (3–5): robust to misspellings, abbreviations, and out-of-vocabulary words. Critical for user-generated text — "iphoen" and "iphone" share most character 3-grams. Standard in fraud detection on free-text fields.

Pre-trained sentence embeddings as features: SBERT / all-MiniLM-L6-v2 produces 384-d embeddings. Run offline, store as 384 numeric features, feed into LightGBM alongside the rest. Gets most of the semantic benefit without training a transformer end-to-end. Latency at serve time: batch inference on CPU ~10ms per sentence; on GPU ~1ms.

Simple length / punctuation features: character count, word count, uppercase ratio, exclamation-mark count. Strong signal for spam, sentiment, and review quality. Cost: near zero.

The rule: reach for a fine-tuned LLM only when lighter approaches plateau, not as the default. On Kaggle tabular competitions with text fields, TF-IDF + LightGBM is still the baseline to beat.

Data Leakage — The Top Production Bug in Tabular ML

Leakage is information from outside the training-at-the-time-of-prediction distribution sneaking into features. It inflates offline metrics and silently destroys production models. There are four distinct types and every candidate should be able to name each with a fix:

Target leakage: a feature is directly or causally downstream of the label. Classic: using customer_complained to predict churn when complaints are caused by churn decisions. Less obvious: using policy_cancelled which is set after the fraud flag is raised. Fix: build a causal timeline — every feature must be computable strictly before the label event.

Train/test contamination (preprocessing leakage): fitting a scaler, imputer, or encoder on the full dataset before splitting. The test set's mean/std/category distribution shapes the transform; model learns a transform that would not exist in production. Fix: wrap every fitted step in sklearn.pipeline.Pipeline and pass the pipeline — not the preprocessed arrays — into cross_val_score. The pipeline re-fits every step inside each fold.

Group leakage: multiple rows share an entity (user_id, patient_id, session_id). Random split puts some of a user's rows in train and others in test — the model memorizes user identity. Classic in medical ML. Fix: GroupKFold(n_splits=5) with groups=user_ids.

Temporal leakage: for time-series, random K-fold puts future rows in training. The model sees events from next Tuesday while training on events through today. Fix: TimeSeriesSplit (expanding-window) or explicit chronological cutoffs. Also applies to target encoding — the encoding for category k at time t must only use rows with timestamp < t.

The senior-level move: when a colleague reports "CV AUC = 0.95," the first question is "what is your CV scheme and what does your pipeline object look like?" Not "what model did you use?" Leakage is 10× more likely than model choice to explain the number.

The 4 Leakage Types — Detection and Fix

Leakage TypeConcrete ExampleHow It LooksFix
Target leakage`customer_refunded` as feature for predicting `churn`; refund is a consequence of churnOffline AUC 0.98; production AUC 0.72Causal audit: for every feature, verify it is computable strictly before the label event
Train/test contamination`StandardScaler().fit(X_all)` before `train_test_split`Training CV inflated by 1–3 AUC points vs held-out testWrap in `Pipeline`; fit all transforms per-fold via `cross_val_score`
Group leakageSame patient's 20 visits split between train/testOffline AUC 0.93 on per-row split; 0.71 on per-patient split`GroupKFold(n_splits=5)` with `groups=patient_id`
Temporal leakageRandom K-fold on 2 years of transactions; future trains on past and vice versaOffline looks great; production degrades immediately after deploy`TimeSeriesSplit` / chronological cutoff; features use only data strictly before label timestamp
Target-encoding leakage (subset of #1 and #2)`TargetEncoder.fit(X, y)` on full data then transformTraining AUC jumps 5–10 points over production when using target-encoded featuresK-fold out-of-fold encoding with smoothing, or CatBoost ordered target stats

sklearn ColumnTransformer + Pipeline + Grouped CV

pythonleakage_safe_pipeline.py
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GroupKFold, cross_val_score
from lightgbm import LGBMClassifier

# Define columns by type
NUM_COLS = ["amount", "age_days", "txn_count_7d"]
CAT_LO_COLS = ["country_code", "device_type"]        # cardinality < 15
CAT_HI_COLS = ["merchant_id", "zip_code"]            # cardinality > 100

# Build per-type transformers
numeric = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),                     # no-op for LGBM but here for pedagogy
])

cat_lo = Pipeline([
    ("impute", SimpleImputer(strategy="constant", fill_value="UNK")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=True)),
])

# High-cardinality: use the K-fold target encoder defined elsewhere.
# Here we use frequency encoding as a zero-leakage fallback.
cat_hi = Pipeline([
    ("impute", SimpleImputer(strategy="constant", fill_value="UNK")),
    ("freq", FrequencyEncoder()),                    # custom transformer
])

preproc = ColumnTransformer(
    transformers=[
        ("num", numeric, NUM_COLS),
        ("cat_lo", cat_lo, CAT_LO_COLS),
        ("cat_hi", cat_hi, CAT_HI_COLS),
    ],
    remainder="drop",
)

pipe = Pipeline([
    ("preproc", preproc),
    ("model", LGBMClassifier(
        n_estimators=500, learning_rate=0.05, num_leaves=31, random_state=42,
    )),
])

# Group-aware CV prevents the same user appearing in both train and val
cv = GroupKFold(n_splits=5)
scores = cross_val_score(
    pipe, X, y,
    groups=df["user_id"],            # critical: group by entity, not row
    cv=cv, scoring="roc_auc", n_jobs=-1,
)
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

# Every fold re-fits the imputer, scaler, encoders, and model ONLY on training rows.
# No statistics from val rows reach the training transforms.

Missing Data — MCAR, MAR, MNAR (Rubin 1976)

Rubin's 1976 taxonomy is still the correct framework. The right imputation depends on why the value is missing:

MCAR (Missing Completely At Random): the probability of missingness is independent of both observed and unobserved data. Sensor dropouts, random survey non-response. Safe to drop or mean/median impute. In practice, true MCAR is rare.

MAR (Missing At Random): missingness depends on observed features but not the missing value itself. Income missing more often for younger users — missingness is explained by age, which is observed. Model-based imputation works: regress the missing column on the other columns (MICE — Multivariate Imputation by Chained Equations, van Buuren 2011). Also: add a <col>_was_missing binary indicator — the fact that a value was missing is often itself predictive.

MNAR (Missing Not At Random): missingness depends on the unobserved value. Income missing because it is very high (users refuse to report). This is the hardest case — no unbiased imputation exists without domain modeling. Treat missingness as a feature: users who refuse to report income are systematically different from those who report; encode that.

Practical defaults:

  • Simple: SimpleImputer(strategy="median") for numerics, "most_frequent" or constant="UNK" for categoricals.
  • Better: KNNImputer(n_neighbors=5) — leverages similar rows. Expensive at scale.
  • Best (MAR): IterativeImputer (scikit-learn's MICE implementation). Iterates a regressor per column until convergence.
  • Always: add a missingness indicator column for any feature with >1% missing rate.
  • Trees (XGBoost, LightGBM) handle missing values natively via an optimal-direction default for NaN — no imputation required. Linear models and NNs require imputation.

Feature Selection — Filter, Wrapper, Embedded, Permutation

With 500 candidate features and 50K rows, the model overfits to spurious correlations. Four families of selection methods:

Filter methods (cheap, univariate): variance threshold, Pearson/Spearman correlation with target, mutual information (MI) for non-linear relationships, chi-squared for categorical features. Fast but miss interactions — a feature useless alone can be critical in combination.

Wrapper methods (expensive, model-aware): Recursive Feature Elimination (RFE) trains a model, removes the weakest feature by model importance, and repeats. Captures interactions but is O(d) model fits.

Embedded methods (built into model): L1 (Lasso) drives coefficients to zero → automatic selection for linear models. Tree-based feature_importances_ (split gain) comes free with GBDT. Critical caveat: default gain-based importance is biased toward high-cardinality features (Strobl et al., 2007) — a continuous variable has many possible split points and can accumulate gain spuriously. Never rank features by raw LightGBM feature_importances_ alone.

Permutation importance (unbiased, model-agnostic): randomly shuffle one feature's values in the validation set and measure the drop in score. Unbiased with respect to cardinality. Slower (one prediction per feature) but the standard for honest feature ranking. Use sklearn.inspection.permutation_importance with n_repeats=10.

Boruta (shadow features): add a shuffled "shadow" copy of every feature; a real feature is kept only if it outperforms its shadow across multiple runs. Strong but computationally expensive — good for a final feature set, not for exploration.

Staff-level caveat on SHAP: SHAP explains feature contributions to the model's predictions. It does not explain causal effects in the world. A SHAP plot showing age as the top feature means the model uses age heavily — if the model is miscalibrated on minority groups, SHAP will still rank age highly. Correlation-in-model ≠ real-world cause.

Offline / Online Feature Parity with Point-in-Time Join

Rendering diagram...

Training-Serving Skew — The #1 Production Bug

Every senior ML engineer has been burned by training-serving skew at least once. The pattern is always the same: offline evaluation looks great, deploy to production, AUC silently drops 5–15 points, on-call gets paged a week later when a product metric dips. The cause is almost never the model.

The top three skew sources:

  1. Different code paths. Training uses a Pandas notebook; serving uses a Java microservice. Both claim to compute "30-day click count." One uses timezone-aware timestamps, the other UTC. One counts distinct events, the other total events. Distributions diverge. Fix: single feature definition in a registry (Feast, Tecton, Michelangelo, Zipline). Both paths call the same function.

  2. Different time horizons. Training computes "user lifetime purchase count" over full history. Serving only has 90 days due to a data retention policy. The feature the model was trained on does not exist in production. Fix: enforce retention parity at the feature registry, not at the consumer.

  3. Missing point-in-time correctness. Training joins labels from January with features as of today. The feature user_30d_fraud_rate for a January label row includes fraud that happened in February. The model learns to use future-leaking features; at serve time, those features are much weaker. Fix: entity_df point-in-time joins (native to Feast, Tecton, BigQuery AS OF).

The staff-level framing: a feature store is not a database. It is a regression test harness for feature contracts. Every production feature should have a daily job comparing the online distribution to the offline distribution (Population Stability Index, PSI > 0.25 = alert). This is ongoing, not a one-time setup. This is what separates "I use Feast" from "I operate a feature platform."

TIP

What to Say in the Interview Tomorrow

Structure every feature engineering answer around four axes in this order:

1. Per-column plan by (cardinality × model family): "Cat columns K < 15 → one-hot; K ∈ [100, 1000] → K-fold target encoding with α=20 smoothing; K > 10K → hashing or embedding. Numeric: log-transform right-skewed for linear, no-op for trees."

2. Leakage controls baked into the pipeline: "Everything goes inside a Pipeline(ColumnTransformer, Model). I pass the Pipeline to cross_val_score — never preprocessed arrays. For temporal data, TimeSeriesSplit. For grouped data, GroupKFold."

3. Target encoding is the trap: "I never use naive TargetEncoder.fit(X, y). Either K-fold out-of-fold with smoothing (n·ȳ_k + α·ȳ)/(n + α), or CatBoost's ordered target statistics which are leakage-safe by construction."

4. Training-serving parity: "Features are defined once in a registry (Feast / Tecton / Michelangelo). Offline Spark and online Flink both call the same definition. Point-in-time joins for historical training data. PSI monitoring for drift in production."

If you hit all four and cite Rubin 1976, Strobl 2007, He 2014 (FB crossed features), or Prokhorenkova 2018 (CatBoost) for the specific claim you are making, you will clear the staff-level bar.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →