Sections

0/15

Related Guides

Bias-Variance Tradeoff & ML Debugging

Machine Learning

40m

ML System Design: 6-Step Framework

ML System Design

40m

XGBoost: Gradient Boosting Deep Dive

Machine Learning

65m

Quiz

← Back to Library

Machine Learning·Intermediate

ML Evaluation Metrics: The Complete Guide

Know exactly which metric to use for which problem type, and why. Covers Precision, Recall, F1, ROC-AUC, PR-AUC, NDCG, calibration, regression metrics, and when each is misleading. 10 hard interview questions with detailed answers.

45 min read 15 sections 8 interview questions

PrecisionRecallF1AUCNDCGClass ImbalanceCalibrationRMSEMRR

The Metric Selection Problem

Choosing the wrong metric is one of the most common ML interview mistakes. Using accuracy on an imbalanced dataset (99% negative examples) makes a model that always predicts negative look 99% accurate — but it's completely useless.

Every metric choice involves a tradeoff that must be tied to business impact. Ask these four questions before choosing any metric:

What is the class distribution? (balanced vs. imbalanced)
What is the cost of each error type? (False Positive vs. False Negative)
Is this a ranking or classification problem?
Does the output probability need to be calibrated — trusted as a real probability?

Metric Decision Framework

Understand class distribution

Balanced classes → accuracy is fine. Imbalanced (>10:1) → never use accuracy alone. For 99:1 imbalanced classes, use PR-AUC, not ROC-AUC. Rule: if the minority class is what you care about (fraud, cancer, rare failure), ROC-AUC can be dangerously optimistic.

Identify cost of errors

What's worse: False Positive or False Negative? Cancer screening: FN is catastrophic (missed cancer) → maximize Recall. Spam filter: FP is bad (real email in spam) → maximize Precision. Fraud at a bank: FN (missed fraud) costs money but FP (blocked legitimate customer) damages trust — business determines the ratio.

Choose primary metric

High FP cost → optimize Precision. High FN cost → optimize Recall. Balance both → F1. Ranking/search → NDCG or MAP. Binary classification, imbalanced → PR-AUC. Binary, balanced → ROC-AUC. Regression → RMSE if large errors matter more, MAE if you want equal weight.

Set a constraint metric

Usually 'maximize Recall subject to Precision > 0.8' or vice versa. Threshold tuning gives you the curve; the operating point is where business value is maximized. Never ship a model without explicitly deciding on the threshold — default 0.5 is almost never optimal.

Check calibration

If model output is used as a probability (credit risk: P(default) = 0.15 → charge 15% interest), you need calibration. A model with AUC=0.95 but poor calibration could output P=0.03 when the true risk is 30%. Use reliability diagrams and Expected Calibration Error (ECE) to audit.

Classification Metric Reference

Metric	Formula	Use When	Watch Out For
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes only	Misleading on any imbalance > 5:1
Precision	TP/(TP+FP)	FP is costly (spam, fraud alerts, medical FP=unnecessary treatment)	Denominator is 0 if model never predicts positive
Recall (Sensitivity)	TP/(TP+FN)	FN is costly (cancer detection, security threats, critical failures)	Easy to game: predict everything positive → Recall=1.0
F1 Score	2×P×R/(P+R)	Balance P and R, imbalanced classes, single-number comparison	Treats P and R as equally important (use Fβ if not)
Fβ Score	(1+β²)×P×R / (β²×P+R)	β>1 weights Recall more (β=2 common for medical). β<1 weights Precision more	β must be chosen explicitly; no default β is 'correct'
ROC-AUC	Area under TPR vs FPR	Balanced classes, threshold-independent comparison, ranking quality	Optimistic on imbalanced data — a model scoring negatives at 0.01 and positives at 0.02 can have high AUC
PR-AUC	Area under Precision-Recall curve	Imbalanced classes (fraud, rare disease, anomaly detection)	Lower is harder to interpret than ROC-AUC; use as relative comparison
Matthews Correlation Coefficient (MCC)	(TP·TN - FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))	Best single metric for binary classification with any class balance	Less intuitive than F1 but statistically superior

Evaluation Metrics — Confusion Matrix to Business Impact

Rendering diagram...

ROC-AUC vs PR-AUC — Why They Diverge on Imbalanced Data

Rendering diagram...

Threshold Selection — The Forgotten Step

Most practitioners train a classifier, look at AUC, and call it done. But AUC is threshold-independent — you still need to pick an operating threshold to deploy. The right threshold depends on the business problem:

Expected Value Maximization: for fraud detection, the optimal threshold maximizes: E[value] = TP_rate × V(catch_fraud) - FP_rate × C(block_legit). If catching fraud is worth $500 but blocking a legitimate customer costs $50, set threshold to equalize marginal gains.

F-score at threshold: plot F1 (or Fβ) vs threshold. Pick the maximum. This is equivalent to maximizing precision-recall tradeoff for a given β.

Youden's J statistic: J = Sensitivity + Specificity - 1 = TPR - FPR. Maximizing J maximizes the ROC curve's 'corner' distance from the diagonal. Works for balanced classes.

Precision at k: in search/recommendation, evaluate Precision@10 — are the top 10 results relevant? Threshold is implicit in the cutoff k.

Threshold Optimization from Scratch

pythonthreshold_optimization.py

import numpy as np
from sklearn.metrics import precision_recall_curve, roc_curve, f1_score

def find_optimal_threshold(y_true, y_proba, method="f1", beta=1.0,
                           tp_value=500.0, fp_cost=50.0):
    """
    Find operating threshold using different criteria.
    
    method options:
      "f1"       — maximize F-beta score
      "youden"   — maximize TPR-FPR (good for balanced)
      "ev"       — maximize expected value (use tp_value, fp_cost)
      "prec_at_recall" — max precision subject to recall >= target
    """
    thresholds = np.linspace(0, 1, 300)
    
    if method == "f1":
        scores = []
        for t in thresholds:
            y_pred = (y_proba >= t).astype(int)
            scores.append(f1_score(y_true, y_pred, beta=beta, zero_division=0))
        best_t = thresholds[np.argmax(scores)]
        
    elif method == "youden":
        fpr, tpr, roc_thresh = roc_curve(y_true, y_proba)
        j = tpr - fpr
        best_t = roc_thresh[np.argmax(j)]
        
    elif method == "ev":
        n_pos = y_true.sum()
        n_neg = len(y_true) - n_pos
        ev_scores = []
        for t in thresholds:
            y_pred = (y_proba >= t).astype(int)
            tp = ((y_pred == 1) & (y_true == 1)).sum()
            fp = ((y_pred == 1) & (y_true == 0)).sum()
            ev = tp * tp_value - fp * fp_cost
            ev_scores.append(ev)
        best_t = thresholds[np.argmax(ev_scores)]
        
    elif method == "prec_at_recall":
        # max precision where recall >= target
        target_recall = 0.8
        prec, rec, pr_thresh = precision_recall_curve(y_true, y_proba)
        valid = rec >= target_recall
        if valid.any():
            best_t = pr_thresh[valid[:-1].argmax()]  # highest precision in valid range
        else:
            best_t = 0.5
            
    return best_t

Regression Metrics Reference

Metric	Formula	Penalizes Large Errors?	Use When
MAE	mean(\|y − ŷ\|)	No (linear)	Robust to outliers, interpretable in units of target. 'Average prediction is off by $X'.
RMSE	√mean((y − ŷ)²)	Yes (quadratic)	When large errors are especially costly (financial risk, safety systems). Same units as target.
MAPE	mean(\|y − ŷ\| / \|y\|) × 100%	No	When relative error matters more than absolute. Warning: undefined when y=0; biased downward.
SMAPE	mean(2\|y-ŷ\|/(\|y\|+\|ŷ\|)) × 100%	No	Symmetric MAPE, handles y=0 better. Still biased toward under-prediction.
R² (R-squared)	1 − SS_res/SS_tot	No	Fraction of variance explained. Can be negative (model worse than mean). Misleading on non-linear data.
Huber Loss	quadratic if \|r\|≤δ, else linear	Partial	Robust regression — acts like MSE for small errors, MAE for outliers. δ is a hyperparameter.
Pinball (Quantile)	max(τ·r, (τ-1)·r)	No	Quantile regression / prediction intervals. τ=0.9 → predict 90th percentile of distribution.

Ranking Metrics — NDCG, MAP, MRR

MRR — Mean Reciprocal Rank

MRR = (1/|Q|) Σ 1/rank_first_relevant. Measures where the first relevant result appears. If the relevant result is rank 1: 1.0. Rank 2: 0.5. Rank 5: 0.2. Best for: single-answer questions, site search, QA systems. Limitation: only cares about the FIRST relevant result — ignores all others.

MAP — Mean Average Precision

AP = Σₖ P(k) × rel(k) / |relevant_docs|, where P(k) = precision at cutoff k, rel(k) = 1 if item at rank k is relevant. MAP = mean(AP) over queries. Good for: multi-relevant document retrieval. Weights early relevant results higher than late ones. Limitation: assumes binary relevance (relevant/not relevant).

NDCG — Normalized Discounted Cumulative Gain

DCG = Σₖ (2^rel_k - 1) / log₂(k+1). The log₂(k+1) discounts lower positions. NDCG = DCG / IDCG (ideal DCG). Allows graded relevance (0=not relevant, 1=somewhat, 2=highly relevant) — key advantage over MAP. Best for: search ranking, recommendation systems where some results are 'better' than others. NDCG@10 measures quality of top 10 results.

Precision@K and Recall@K

P@K = fraction of top-K results that are relevant. R@K = fraction of ALL relevant items that appear in top-K. P@10 common in web search. R@K used in recommendation (did we surface the relevant items?). These are the most interpretable ranking metrics for stakeholders.

NDCG Implementation from Scratch

pythonndcg_implementation.py

import numpy as np

def dcg_at_k(relevances: list, k: int) -> float:
    """
    Discounted Cumulative Gain at K.
    relevances: list of relevance scores (0, 1, 2, ...) in ranked order
    """
    relevances = np.array(relevances[:k], dtype=float)
    if len(relevances) == 0:
        return 0.0
    # Position 1-indexed: positions 1, 2, ..., k
    positions = np.arange(1, len(relevances) + 1)
    discounts = np.log2(positions + 1)  # log2(2)=1, log2(3)≈1.58, ...
    gains = (2 ** relevances - 1) / discounts
    return gains.sum()

def ndcg_at_k(relevances: list, k: int) -> float:
    """
    Normalized DCG at K: DCG / IDCG.
    IDCG = DCG of the ideal (perfectly sorted) ranking.
    """
    ideal = sorted(relevances, reverse=True)  # best possible ordering
    idcg = dcg_at_k(ideal, k)
    if idcg == 0:
        return 0.0
    return dcg_at_k(relevances, k) / idcg

# Example: search results with graded relevance
# 2=highly relevant, 1=somewhat, 0=not relevant
actual_order    = [2, 0, 1, 0, 2]  # what model returned
ideal_order     = [2, 2, 1, 0, 0]  # best possible order

print(f"NDCG@5: {ndcg_at_k(actual_order, k=5):.4f}")  # ≈ 0.891
print(f"NDCG@3: {ndcg_at_k(actual_order, k=3):.4f}")  # ≈ 0.796

def mean_reciprocal_rank(queries_results: list[list[bool]]) -> float:
    """MRR over a list of query results. Each inner list = [relevant?, ...] in rank order."""
    reciprocal_ranks = []
    for results in queries_results:
        for rank, is_relevant in enumerate(results, 1):
            if is_relevant:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)  # no relevant result found
    return np.mean(reciprocal_ranks)

⚠ WARNING

ROC-AUC vs PR-AUC on Imbalanced Data — The Classic Trap

With ~1% positive rate (fraud), ROC-AUC of 0.99 sounds great but is misleading. Here's why: at ~1% positive rate, even a random classifier that correctly ranks just slightly better than random will have very high ROC-AUC because there are so many negatives that TN accumulates rapidly. The FPR denominator (FP / ALL NEGATIVES) is divided by ~99% of the data, making FPR very forgiving. A classifier that scores all negatives at 0.01 and all positives at 0.02 has high ROC-AUC but terrible PR-AUC because precision at any recall is extremely low. Always use PR-AUC when positive class is < 10% of data. This result is formalized in Saito & Rehmsmeier, 2015, which shows PR-AUC is more informative than ROC-AUC for class-imbalanced problems. This is a common interview trap.

Model Calibration — When Probabilities Must Be Trusted

A model is well-calibrated if its predicted probabilities match empirical frequencies: of all samples where the model predicts P(positive) = 0.7, about 70% should actually be positive. Calibration matters in: insurance/credit (probability directly drives pricing), medical diagnosis (probability informs treatment decisions), ad bidding (probability × ad value determines bid), confidence-weighted ensembles.

Calibration Diagnostics:

Reliability Diagram: bucket predictions into 10 bins (0–0.1, 0.1–0.2, …). Plot mean predicted probability vs. fraction of positives in each bucket. Perfect calibration = diagonal line.
Expected Calibration Error (ECE): ECE = Σᵦ (|Bᵦ|/n) × |acc(Bᵦ) - conf(Bᵦ)|. Weighted average of accuracy-confidence gap per bin.
Brier Score: mean squared error of probabilities. Lower = better. Brier = (1/n) Σ (p - y)². Decomposition: Brier = Uncertainty - Resolution + Calibration.

Calibration Methods:

Platt Scaling: fit a logistic regression on top of model's raw scores on a held-out set. Best for SVMs and other non-probabilistic models.
Isotonic Regression: non-parametric monotonic calibration. More flexible than Platt, but needs more data.
Temperature Scaling (for neural nets): divide logits by scalar T before softmax. Single parameter, maintains accuracy while improving calibration. Modern standard for deep networks.

Calibration Check and Platt Scaling

pythoncalibration.py

import numpy as np
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression

def expected_calibration_error(y_true, y_proba, n_bins=10):
    """
    ECE: weighted average gap between predicted confidence and actual accuracy.
    Lower is better. ECE < 0.05 is generally considered well-calibrated.
    """
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    n = len(y_true)
    
    for i in range(n_bins):
        mask = (y_proba >= bins[i]) & (y_proba < bins[i + 1])
        if mask.sum() == 0:
            continue
        bin_conf = y_proba[mask].mean()       # mean predicted probability
        bin_acc  = y_true[mask].mean()         # actual fraction positive
        ece += (mask.sum() / n) * abs(bin_conf - bin_acc)
        
    return ece

def temperature_scaling(logits, y_val, n_iter=100, lr=0.01):
    """
    Temperature scaling: calibrate deep network by finding scalar T
    that minimizes NLL on a held-out validation set.
    
    Intuition: T > 1 → soften probabilities (overconfident model)
                T < 1 → sharpen probabilities (underconfident model)
    """
    T = 1.0
    for _ in range(n_iter):
        p = np.exp(logits / T) / np.exp(logits / T).sum(axis=1, keepdims=True)
        # gradient of NLL w.r.t. T
        nll_grad = -np.mean(
            (y_val - p) * logits / (T ** 2)
        )
        T -= lr * nll_grad
        T = max(0.1, T)  # prevent T from going negative
    return T

# Using sklearn's built-in calibration
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV

base_model = GradientBoostingClassifier()
# method='sigmoid' = Platt scaling, method='isotonic' = isotonic regression
calibrated_model = CalibratedClassifierCV(base_model, method='sigmoid', cv=5)

Multiclass Metrics — Macro vs Micro vs Weighted

Averaging	How Computed	Use When
Micro-average	Pool all TP, FP, FN across classes. Then compute P/R/F1.	When overall performance across all samples matters. Dominated by majority class.
Macro-average	Compute P/R/F1 per class. Then take unweighted mean.	When each class is equally important regardless of size. Better for imbalanced multiclass.
Weighted-average	Compute P/R/F1 per class. Mean weighted by class support (sample count).	Default sklearn setting. Balances between micro and macro. Good general-purpose.
Per-class (OvR)	One class vs. all others at a time.	When debugging which specific class is performing poorly.

EXAMPLE

Interview Scenario: Metric Selection for Medical Diagnosis

Setup: Binary classifier for detecting rare cardiac anomaly. ~0.5% of patients have the condition (illustrative). 50,000 patients/year scanned.

Wrong choice — Accuracy: ~99.5% by predicting everyone negative. Useless.

Wrong choice — ROC-AUC: Could be 0.97 and still miss most true positives if calibrated wrong.

Right choice — PR-AUC: Directly measures quality on the rare class. Plot the full P-R curve and decide: 'We want at least 90% Recall (miss at most 10% of true cases), and we can tolerate Precision of 30% (70% false alarm rate, acceptable given the alternative).'

Right operating threshold: find threshold where Recall >= 0.90, then pick the highest Precision point on the P-R curve satisfying that constraint.

Calibration: The cardiologist needs to know if P=0.8 means 80% likely positive or just 'high risk'. Use Platt scaling on a held-out set. Report ECE alongside AUC in model card.

Business metric: downstream, track: (missed anomalies per 1000 scans) vs (unnecessary follow-ups per 1000 scans). These drive actual cost-benefit.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.