Sections

The complete decision framework for imbalanced classification — fraud, rare disease, ad CTR. Covers why accuracy and ROC-AUC lie under imbalance, when SMOTE hurts rather than helps on tabular data, focal loss vs class weights, Pozzolo's calibration correction after undersampling, and why threshold tuning is the single most under-used technique in production ML.

50 min read 19 sections 8 interview questions

Class ImbalanceSMOTEFocal LossThreshold TuningPR-AUCCost-Sensitive LearningFraud DetectionCalibrationscale_pos_weightUndersamplingOversamplingPozzolo CorrectionAnomaly Detection

Why Imbalance Is Not Always a Problem — And When It Actually Is

Most interview prep treats class imbalance as a single problem with a single fix ('use SMOTE'). This is wrong on both counts. Imbalance is not inherently a problem — it becomes a problem only when three specific conditions are present, and the right fix depends on which one dominates.

Condition 1 — You chose a bad metric. Accuracy on a 99:1 imbalanced dataset is meaningless because the majority-class baseline already achieves 99%. The fix here isn't resampling — it's picking a metric sensitive to the minority class (PR-AUC, recall at fixed precision, F-beta).

Condition 2 — Your loss function assumes balanced priors. Standard cross-entropy treats every example equally, so with 1,000 negatives per positive, 99.9% of the gradient signal comes from negatives. The model drifts toward 'predict negative always' as a local optimum. The fix is algorithm-level: class weights, focal loss, or a Bayes-optimal threshold shift.

Condition 3 — Business costs are asymmetric. Missing a $50K fraud is not the same mistake as blocking a $12 coffee purchase. Resampling can't encode this — only a cost matrix or a business-cost-aware threshold can.

Real-world imbalance regimes:

Ad CTR: 1–5% positive rate. Mild imbalance. Class weights usually sufficient.
Rare disease screening: 0.1–1% positive. Moderate. Need focal loss or cost-sensitive learning.
Fraud detection: 0.01–0.1% positive (Kaggle Credit Card Fraud benchmark: 0.172%). Severe. Cost-sensitive threshold + calibration are mandatory.
Beyond 1:10,000: stop framing as classification. Switch to anomaly detection (Isolation Forest, one-class SVM, autoencoder reconstruction error).

The interview mistake most candidates make: jumping to SMOTE before diagnosing which condition actually applies.

IMPORTANT

What Interviewers Evaluate on This Topic

Mid-level: Do you know accuracy is wrong on imbalanced data? Can you name PR-AUC, class weights, SMOTE? Do you know the default 0.5 threshold is often wrong?

Senior-level: Do you know SMOTE often hurts on tabular + gradient boosting (Elor & Averbuch-Elor 2022 benchmarks)? Can you derive the Bayes-optimal threshold from a cost matrix? Do you know resampling breaks probability calibration and how to fix it? Can you explain why ROC-AUC is optimistic under imbalance (Davis & Goadrich 2006)?

Staff-level: Do you know Pozzolo's calibration correction after undersampling (p_s = p / (p + (1-p)/β))? Do you know FAANG ranking teams do NOT rebalance — they use pairwise/listwise losses with cost-aware negative sampling (YouTube's importance sampling)? Can you reason about when to reframe as anomaly detection vs. classification? Do you know the CV leakage trap from applying SMOTE before the train/val split?

Clarifying Questions — Ask Before Touching Resampling

What is the positive-class rate, and is it stable over time?

1–5% is mild (CTR). 0.1–1% is moderate (rare disease). <0.01% is severe (fraud) and may warrant anomaly-detection framing. If the rate drifts (fraud, seasonal), the operating threshold must drift too — a fixed threshold becomes miscalibrated within weeks.

What are the relative costs of FP vs FN?

If the interviewer doesn't specify, ask. Cost ratio directly determines the optimal threshold: τ* = C_FP / (C_FP + C_FN). For a 100:1 cost ratio, τ* ≈ 0.01 — not 0.5. Without this number, no amount of resampling will fix the final decision rule.

Do downstream consumers need calibrated probabilities or just ranked scores?

Ranking use cases (top-k retrieval, risk scoring) only need monotonic scores — resampling is fine. Decision use cases (auto-decline fraud, auto-triage patients) need calibrated probabilities — resampling breaks calibration and must be corrected (Pozzolo 2015).

How much minority data do we actually have?

If you have 100,000 positives, resampling is a solved problem. If you have 50 positives, you're in a data-scarcity regime — SMOTE interpolates between 50 points and creates unrealistic synthetic examples. Few-shot learning, active labeling, or anomaly detection may be better.

What model family are we using?

Gradient boosting (XGBoost, LightGBM) has built-in scale_pos_weight — no resampling needed. Neural networks benefit from focal loss. Logistic regression with class_weight='balanced' is often sufficient. The fix is algorithm-specific.

Evaluation First — Because the Wrong Metric Makes Every Other Fix Invisible

Before any algorithm change, fix the metric. A model optimized for the wrong metric will look great and perform terribly, regardless of how you handle imbalance.

Accuracy — useless under imbalance. 99% accuracy on a 99:1 dataset is achieved by predicting the majority class for everything. Report it only as a sanity check, never as the primary metric.

ROC-AUC — silently optimistic under imbalance. Davis & Goadrich (2006) showed that ROC curves are deceptive when negatives vastly outnumber positives because FPR = FP / (FP + TN) has a huge TN in the denominator. A model can achieve ROC-AUC = 0.99 while its precision at any useful threshold is 10%. Use ROC-AUC for balanced tasks only.

PR-AUC (average precision) — the correct primary metric for imbalanced classification. Precision and recall both operate on the positive class, so the metric isn't dominated by the easy-to-classify majority negatives. Under severe imbalance, PR-AUC is typically far lower than ROC-AUC — that gap is honest; don't smooth it over.

Precision@k / Recall@k — when operational capacity is fixed (e.g., fraud analysts can review 1,000 alerts/day), report precision at the top-k scored examples. This maps directly to business impact.

Recall at fixed precision (e.g., recall@precision=0.9) — the metric that decouples model quality from threshold choice and directly answers 'how much fraud can I catch without blocking more than 10% legitimate transactions?'

F-beta — F1 weights precision and recall equally; F-beta with β = sqrt(C_FN / C_FP) weights them by their business costs. For fraud with 100:1 cost ratio, β = 10.

Expected cost — the only metric with units of dollars: EC = FP × C_FP + FN × C_FN. Impossible to game. Report this as the headline metric when cost data is available.

Evaluation Metrics Under Imbalance — What Each Tells You

Metric	Formula	When to Use	When It Lies
Accuracy	(TP+TN)/N	Balanced classes only	99% accuracy on 99:1 imbalance is the majority baseline
ROC-AUC	Area under TPR vs FPR	Balanced or threshold-independent ranking	Dominated by TN under imbalance; Davis & Goadrich 2006
PR-AUC	Area under Precision-Recall curve	Primary metric for imbalanced binary classification	Still needs calibrated scores for threshold selection
Precision@k	TP@k / k	Fixed operational capacity (fraud analysts, ad slots)	Sensitive to the chosen k; pick k = business capacity
Recall @ fixed precision	TPR given Prec ≥ p*	SLA-style constraints (FP budget is hard-capped)	Requires calibration to map threshold → precision
F-beta	(1+β²)·P·R / (β²·P+R)	Asymmetric cost, β = sqrt(C_FN/C_FP)	Still a single scalar; hides the PR curve shape
Expected cost	FP·C_FP + FN·C_FN	When C_FP, C_FN are known in $	Requires reliable cost estimates from business
MCC	(TP·TN - FP·FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))	Single scalar robust to imbalance	Less intuitive than PR-AUC for stakeholders

Solution Decision Tree — Which Technique for Which Regime

Rendering diagram...

Algorithm-Level Solutions — Class Weights, Cost-Sensitive Learning, Focal Loss

Algorithm-level solutions modify the loss function rather than the data. They are almost always the right first move because they don't distort the data distribution and don't require post-hoc calibration correction.

Class weights (simplest, most underused): scikit-learn's class_weight='balanced' sets weight for class c to N / (K × N_c). For a 99:1 dataset this is {0: 0.505, 1: 50.0} — positive examples contribute 100× more to the loss. In XGBoost and LightGBM, use scale_pos_weight = N_neg / N_pos (≈ 99 for 99:1 imbalance). In PyTorch, nn.BCEWithLogitsLoss(pos_weight=torch.tensor([99.0])). Simple, no hyperparameter to tune (the ratio is determined by the data), works for most mild-to-moderate imbalance.

Cost-sensitive learning: extends class weights to a full cost matrix when C_FP ≠ C_FN. The weight ratio becomes (C_FN / C_FP) × (N_neg / N_pos). For 100:1 cost ratio and 1% base rate: scale_pos_weight = 100 × 99 = 9,900. This directly encodes business cost into the loss function.

Focal Loss (Lin et al., 2017 — RetinaNet): down-weights easy examples, not just minority ones. The key distinction from class weights:

Class weights scale all positives uniformly — a positive that the model already predicts correctly at p=0.99 still contributes fully weighted gradient.
Focal loss FL = -(1-p_t)^γ · log(p_t) down-weights both easy positives and easy negatives. The remaining gradient comes from hard, ambiguous examples.

With γ=2, an easy negative at p=0.99 contributes (0.01)² = 10^-4 — essentially zero. A hard example at p=0.5 contributes 0.25 — nearly full weight. The net effect is implicit hard example mining without an explicit mining pipeline. Standard in RetinaNet, YOLO v5+, EfficientDet.

When focal loss beats class weights: one-stage object detection (99,990 easy background anchors dominate CE), severe imbalance beyond 100:1, when you can't afford offline hard negative mining.

When focal loss loses to class weights: tabular gradient boosting (trees already do implicit selection), balanced data (focal over-weights the few hard examples), early training (the modulating factor only activates once easy examples become easy — typically after 1K–5K iterations).

Threshold tuning (underused, massive impact): regardless of what loss you use, the default threshold of 0.5 is almost always wrong. For asymmetric costs, the Bayes-optimal threshold is τ* = C_FP / (C_FP + C_FN). This single change often delivers 80% of the total gain available from imbalance handling.

Focal Loss and Bayes-Optimal Threshold

Threshold Tuning With Business Cost — The Highest-ROI Technique

pythonthreshold_tuning.py

import numpy as np
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(
    y_true: np.ndarray,
    y_proba: np.ndarray,
    cost_fp: float,
    cost_fn: float,
) -> tuple[float, float]:
    """
    Find the threshold that minimizes expected business cost.

    Returns (threshold, expected_cost_per_example).
    Bayes-optimal threshold is C_FP / (C_FP + C_FN) for a
    well-calibrated model, but empirical scan handles miscalibration.
    """
    # Scan thresholds derived from predicted scores
    thresholds = np.unique(np.concatenate([[0.0], y_proba, [1.0]]))
    best_cost = np.inf
    best_threshold = 0.5

    for tau in thresholds:
        y_pred = (y_proba >= tau).astype(int)
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))
        cost = fp * cost_fp + fn * cost_fn
        if cost < best_cost:
            best_cost = cost
            best_threshold = tau

    return best_threshold, best_cost / len(y_true)


# Example: fraud detection with asymmetric costs
# C_FN = $500 (average fraud loss), C_FP = $5 (customer friction)
tau_star, avg_cost = find_optimal_threshold(
    y_val, model.predict_proba(X_val)[:, 1],
    cost_fp=5.0, cost_fn=500.0,
)
# Typical outcome: tau_star ≈ 0.01–0.05, NOT 0.5
# Deploy: predict fraud if p > tau_star
print(f"Optimal threshold: {tau_star:.4f}  | Cost/tx: ${avg_cost:.2f}")

Data-Level Solutions — SMOTE, Undersampling, and Why They Often Backfire

Data-level solutions modify the training set rather than the loss. They are popular because they're intuitive, but they carry risks that most tutorials skip.

Random oversampling (duplicate minority): trivially implemented, but duplicates cause overfitting — the model memorizes minority points verbatim. Rarely outperforms class weights.

SMOTE (Chawla et al., 2002): Synthetic Minority Oversampling Technique. For each minority point, pick a random neighbor among its k nearest minority neighbors and generate a synthetic point on the line segment between them. Intuition: fill in the minority manifold with interpolated examples.

The trap: SMOTE assumes the minority class lies on a convex manifold where linear interpolation produces plausible examples. This is often false on tabular data — the line between two fraud transactions may land in a region of feature space that is categorically legitimate (a fraudulent transaction of $500 on a high-limit card and a fraudulent transaction of $5 on a prepaid card, interpolated, yields a $252 transaction with half-high / half-prepaid card attributes — physically impossible). Empirically, Elor & Averbuch-Elor (2022) benchmarked SMOTE variants against class weights across 73 tabular datasets and found that SMOTE rarely beats properly tuned class weights on gradient boosting. For tabular + XGBoost/LightGBM: skip SMOTE.

Variants (Borderline-SMOTE, ADASYN) focus synthesis on the decision boundary but don't fix the manifold assumption. Tomek links and ENN remove overlapping points — these are cleaning techniques, not balancing; apply cautiously.

Random undersampling: drop majority examples. Information loss is severe for data-scarce problems but can help when computational budget is the constraint. EasyEnsemble (Liu et al., 2009) undersamples the majority multiple times and trains an ensemble on each subsample — recovers the lost information through bagging.

The single most important SMOTE trap: apply SMOTE BEFORE the train/val split and you leak test information into training. SMOTE synthesizes new points from neighbors; if a validation point ends up as a neighbor used to generate a training synthetic, information has flowed across the split. Your CV score becomes optimistic by 5–15 points in PR-AUC. Fix: apply SMOTE inside each CV fold using imblearn.pipeline.Pipeline, not sklearn.pipeline.Pipeline.

SMOTE Leakage — Wrong vs Correct Pipeline

Rendering diagram...

Leakage-Safe SMOTE and Focal Loss Implementations

pythonimbalance_implementations.py

import torch
import torch.nn.functional as F
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline  # NOT sklearn.pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score

# ---- Leakage-safe SMOTE: SMOTE is refit inside each CV fold ----
# imblearn's Pipeline applies SMOTE only during .fit(), never .predict()
pipeline = Pipeline([
    ("smote", SMOTE(sampling_strategy=0.1, k_neighbors=5, random_state=42)),
    ("clf", LogisticRegression(class_weight="balanced", max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring="average_precision")
# PR-AUC per fold. If this is 5+ points higher than without the pipeline,
# the previous pipeline was leaking.


# ---- Focal Loss for imbalanced binary classification ----
class FocalLoss(torch.nn.Module):
    """
    Lin et al. 2017. gamma=2, alpha=0.25 are RetinaNet defaults.
    Down-weights easy examples so hard ones dominate the gradient.
    """

    def __init__(self, alpha: float = 0.25, gamma: float = 2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        bce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
        p_t = torch.exp(-bce)  # model's confidence in correct class
        focal_weight = (1 - p_t) ** self.gamma
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        return (alpha_t * focal_weight * bce).mean()


# ---- XGBoost with scale_pos_weight (usually enough for tabular) ----
# import xgboost as xgb
# n_neg, n_pos = (y == 0).sum(), (y == 1).sum()
# model = xgb.XGBClassifier(
#     scale_pos_weight=n_neg / n_pos,  # handles imbalance directly
#     eval_metric="aucpr",              # PR-AUC, not ROC-AUC
#     max_depth=6, n_estimators=500, learning_rate=0.05,
# )

Calibration After Resampling — The Pozzolo Correction

Resampling changes the class prior in the training data, which means the model's predicted probabilities no longer reflect the true population prior. This is catastrophic for any downstream decision that depends on calibrated probabilities (cost-sensitive thresholds, credit pricing, risk ranking).

The intuition: if you undersample negatives at rate β (keeping β × N_neg examples), your training data has an artificially boosted minority rate. A model trained on this data outputs P(fraud | x, resampled data), not P(fraud | x, true data). If your true fraud rate is 0.1% and you undersampled to a 50/50 training set, your model's predicted 0.5 probability actually corresponds to a true probability of ~0.1%.

Pozzolo et al. (2015) derived the exact correction for undersampling. If p_s is the probability predicted by the model trained on undersampled data and β is the subsampling rate applied to the majority class, the true probability is:

p = β · p_s / (β · p_s - p_s + 1)

Or equivalently: p_s = p / (p + (1-p)/β).

Apply this as a post-processing step on every prediction. Without it, your Bayes-optimal threshold calculation is wrong, your probability-weighted expected value is wrong, and any downstream system (pricing, ranking by expected fraud loss) is systematically biased.

Practical workflow (production-grade):

Train on undersampled balanced data (fast, simple loss landscape).
Apply Pozzolo correction on inference-time probabilities.
Validate with a reliability diagram on held-out, un-resampled data. Group predictions into 10 equal-frequency bins; mean predicted probability in each bin should match the observed positive rate.
If the reliability diagram still deviates, fit Platt scaling or isotonic regression on top of the Pozzolo-corrected probabilities.

For SMOTE (oversampling), the correction is different because SMOTE inflates minority without removing majority information. Platt scaling or isotonic regression on held-out untouched data is the standard approach — no closed-form correction.

For class weights, calibration is usually preserved because the data distribution is untouched; only the gradient weighting changes. Still, tree-based models (XGBoost, LightGBM) produce uncalibrated scores regardless — always run a reliability diagram.

Pozzolo Calibration Correction for Undersampling

When to Stop Framing as Classification — Anomaly Detection

Beyond 1:10,000 imbalance, the classification framing starts to break down. You don't have enough positives to learn a class boundary, and the 'positive class' is often heterogeneous — many distinct rare phenomena lumped together (different fraud schemes, different disease subtypes, different system failures).

Anomaly detection reframes the problem: instead of 'learn what fraud looks like', learn 'what normal looks like' and flag deviations. This requires only negative examples (abundant) and implicitly handles novel attack patterns that pure classifiers miss.

Isolation Forest (Liu et al., 2008): builds random trees where paths to leaves are shorter for outliers (anomalies are easier to isolate with random splits). Scales well, no tuning beyond contamination rate. Good baseline for tabular anomaly detection.

One-Class SVM: learns a decision boundary around the normal data. Sensitive to kernel choice and the nu parameter; kernel computation is O(N²) — doesn't scale past ~100K points.

Autoencoder reconstruction error: train an autoencoder on normal data only; anomalies reconstruct poorly because the network never learned to represent them. Works well on high-dimensional data (images, network traffic). Reconstruction error itself is the anomaly score.

Hybrid approach (used at Stripe Radar, per their engineering blog): anomaly detection as a first-stage filter that surfaces candidates to a supervised classifier. The anomaly detector handles novel patterns; the supervised model refines known fraud types.

When NOT to reframe as anomaly detection: when you have abundant labeled minority data (10K+ positives), when the minority class is homogeneous (same fraud scheme), when calibrated probabilities are required (anomaly scores are not probabilities). In these cases, supervised classification with focal loss + threshold tuning outperforms anomaly detection.

Solution Method Comparison — Imbalance Techniques

Method	Pros	Cons	When to Use	When NOT to Use
Class weights (sklearn balanced, XGBoost scale_pos_weight)	Simple, no data distortion, preserves calibration	Uniform up-weighting — doesn't distinguish easy from hard positives	First-line fix for most tabular problems, mild-to-moderate imbalance	When hard examples are overwhelmed by easy ones (use focal loss instead)
Cost-sensitive learning (explicit cost matrix)	Directly encodes business cost into loss	Requires known C_FP, C_FN from business	When costs are known and asymmetric (fraud, medical)	When cost data is unreliable or unavailable
Focal Loss (γ=2, α=0.25)	Down-weights easy examples automatically, implicit hard negative mining	Over-weights hard examples on balanced data; unstable early in training	One-stage detection, severe imbalance >100:1, neural networks	Balanced data; tabular GBM (trees already do implicit selection)
Random oversampling	Trivial, preserves all data	Duplicates overfit; models memorize minority verbatim	Rarely — use class weights instead	Deep learning (duplicates don't augment)
SMOTE / Borderline-SMOTE / ADASYN	Synthesizes new points; can smooth minority manifold	Breaks on non-convex manifolds; often underperforms class weights on tabular (Elor & Averbuch-Elor 2022); breaks calibration	Image/text with NN where augmentation is natural; as last resort after class weights fail	Tabular + GBM; high-cardinality categorical features; data with non-convex minority
Random undersampling	Fast training; fixes class-dominated gradient	Loses information from discarded majority	Extreme imbalance with abundant negatives; compute-constrained	Small datasets; when calibration matters (unless Pozzolo correction applied)
EasyEnsemble / BalanceCascade	Recovers undersampling information via bagging	Trains multiple models; more complex pipeline	Severe imbalance with >100K negatives	When model inference budget is tight
Threshold tuning (Bayes-optimal)	~80% of total gain; no data changes; no retraining	Requires calibrated probabilities	ALWAYS combine with other techniques	Never skip — this is the highest-ROI step
Anomaly detection (Isolation Forest, one-class SVM, autoencoder)	Handles novel patterns; needs only normal data	Scores aren't probabilities; harder to threshold	Extreme imbalance >1:10K; heterogeneous minority; novel-attack regimes	Homogeneous minority with 10K+ labels

⚠ WARNING

Top Production Failure Modes

1. SMOTE before CV split — PR-AUC inflated by 5–15 points. Synthetic points built from training neighbors leak into validation. Always use imblearn.pipeline.Pipeline so SMOTE refits per fold.

2. Forgetting to recalibrate after undersampling. Probabilities are distorted by the shifted class prior. Threshold calculations, risk ranking, and pricing become systematically biased. Apply Pozzolo correction or Platt scaling on held-out untouched data.

3. Deploying at threshold = 0.5 after all the imbalance work. You trained with class weights, fit focal loss, tuned hyperparameters — then deployed at the default threshold. τ* = C_FP / (C_FP + C_FN) is almost never 0.5 under imbalance. This single mistake negates every other fix.

4. Using ROC-AUC as the primary offline metric. You'll see ROC-AUC = 0.98 and ship a model with PR-AUC = 0.30. Davis & Goadrich (2006) is 20 years old and this trap still ships in production every week.

5. Applying SMOTE to categorical features without encoding awareness. Interpolating between one-hot encoded categorical values produces fractional one-hots that are semantically meaningless. Use SMOTE-NC for mixed-type data or skip SMOTE entirely.

Why FAANG Ranking Teams Don't Rebalance

A non-obvious staff-level insight: production ranking models at FAANG almost never rebalance their training data. This is counterintuitive — YouTube's recommendation system sees 1:millions imbalance (user watches one of millions of candidate videos), and yet their training pipelines don't use SMOTE or focal loss.

The reason: ranking problems don't use pointwise classification loss. They use pairwise or listwise losses that are inherently imbalance-robust.

Pairwise loss (RankNet, LambdaRank): for each positive example, sample K negatives and minimize -log(sigmoid(score_pos - score_neg)). The loss operates on score differences, so the absolute class ratio doesn't matter — what matters is that the positive outranks the negatives. YouTube's two-tower recommender uses this pattern with importance sampling: negatives are drawn weighted by popularity to simulate in-batch negatives from the full catalog without materializing it.

Listwise loss (ListMLE, Softmax over in-batch): for each positive, compute softmax over all negatives in the batch. With batch size 8,192, each positive sees 8,191 implicit negatives. Used by two-tower models at Google, LinkedIn, Pinterest.

Hard negative mining: at FAANG scale, uniformly random negatives are trivially easy — any random video is clearly not what the user wanted. Instead, hard negatives are mined from (1) in-batch top-scored non-positives, (2) semantically similar but not-clicked items, (3) the model's own high-confidence false positives.

Why this matters for interviews: if you're asked about class imbalance in a recommendation or search context, the answer is not 'use SMOTE'. It's 'reframe as a ranking problem with pairwise/listwise loss and cost-aware negative sampling'. This is the depth that separates staff-level answers from senior-level answers on imbalance questions in an MLSD context.

See YouTube's 'Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations' (Yi et al., 2019) for the canonical reference on importance-sampled softmax for extreme-scale ranking.

TIP

Interview Cheat Sheet — What to Say Tomorrow

Q: 'How would you handle class imbalance?' → Don't jump to SMOTE. State the 3-step framework: (1) fix the metric — PR-AUC not ROC-AUC, (2) algorithm-level fix — class weights or focal loss, (3) Bayes-optimal threshold from business costs. Mention that threshold tuning alone often gets 80% of the gain.

Q: 'Would you use SMOTE?' → For tabular + GBM, no — Elor & Averbuch-Elor (2022) showed class weights match or beat SMOTE on most benchmarks. For image/text + NN, consider data augmentation or focal loss first. If you do use SMOTE, apply it inside CV folds with imblearn.pipeline, never before the split.

Q: 'What metric would you report?' → PR-AUC as headline. Precision@k for fixed capacity operations. Expected cost in dollars when cost data is available. Never accuracy, rarely ROC-AUC.

Q: 'Why does ROC-AUC lie under imbalance?' → FPR has TN in the denominator; TN dominates under imbalance, making FPR appear small even when precision is terrible. Davis & Goadrich 2006.

Q: 'What happens to probabilities after undersampling?' → They're distorted by the shifted class prior. Apply Pozzolo (2015) correction: p = β·p_s / (β·p_s - p_s + 1). Validate with a reliability diagram on untouched held-out data.

Q: 'At 1:100,000 imbalance, is this still classification?' → Probably not. Reframe as anomaly detection — Isolation Forest or autoencoder reconstruction error. Or, if it's a ranking problem, use pairwise/listwise loss with hard negative mining (YouTube, Stripe Radar pattern).

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.