ML Evaluation Metrics: The Complete Guide
Know exactly which metric to use for which problem type, and why. Covers Precision, Recall, F1, ROC-AUC, PR-AUC, NDCG, calibration, regression metrics, and when each is misleading. 10 hard interview questions with detailed answers.
The Metric Selection Problem
Choosing the wrong metric is one of the most common ML interview mistakes. Using accuracy on an imbalanced dataset (99% negative examples) makes a model that always predicts negative look 99% accurate — but it's completely useless.
Every metric choice involves a tradeoff that must be tied to business impact. Ask these four questions before choosing any metric:
- What is the class distribution? (balanced vs. imbalanced)
- What is the cost of each error type? (False Positive vs. False Negative)
- Is this a ranking or classification problem?
- Does the output probability need to be calibrated — trusted as a real probability?
Metric Decision Framework
Understand class distribution
Balanced classes → accuracy is fine. Imbalanced (>10:1) → never use accuracy alone. For 99:1 imbalanced classes, use PR-AUC, not ROC-AUC. Rule: if the minority class is what you care about (fraud, cancer, rare failure), ROC-AUC can be dangerously optimistic.
Identify cost of errors
What's worse: False Positive or False Negative? Cancer screening: FN is catastrophic (missed cancer) → maximize Recall. Spam filter: FP is bad (real email in spam) → maximize Precision. Fraud at a bank: FN (missed fraud) costs money but FP (blocked legitimate customer) damages trust — business determines the ratio.
Choose primary metric
High FP cost → optimize Precision. High FN cost → optimize Recall. Balance both → F1. Ranking/search → NDCG or MAP. Binary classification, imbalanced → PR-AUC. Binary, balanced → ROC-AUC. Regression → RMSE if large errors matter more, MAE if you want equal weight.
Set a constraint metric
Usually 'maximize Recall subject to Precision > 0.8' or vice versa. Threshold tuning gives you the curve; the operating point is where business value is maximized. Never ship a model without explicitly deciding on the threshold — default 0.5 is almost never optimal.
Check calibration
If model output is used as a probability (credit risk: P(default) = 0.15 → charge 15% interest), you need calibration. A model with AUC=0.95 but poor calibration could output P=0.03 when the true risk is 30%. Use reliability diagrams and Expected Calibration Error (ECE) to audit.
Classification Metric Reference
| Metric | Formula | Use When | Watch Out For |
|---|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes only | Misleading on any imbalance > 5:1 |
| Precision | TP/(TP+FP) | FP is costly (spam, fraud alerts, medical FP=unnecessary treatment) | Denominator is 0 if model never predicts positive |
| Recall (Sensitivity) | TP/(TP+FN) | FN is costly (cancer detection, security threats, critical failures) | Easy to game: predict everything positive → Recall=1.0 |
| F1 Score | 2×P×R/(P+R) | Balance P and R, imbalanced classes, single-number comparison | Treats P and R as equally important (use Fβ if not) |
| Fβ Score | (1+β²)×P×R / (β²×P+R) | β>1 weights Recall more (β=2 common for medical). β<1 weights Precision more | β must be chosen explicitly; no default β is 'correct' |
| ROC-AUC | Area under TPR vs FPR | Balanced classes, threshold-independent comparison, ranking quality | Optimistic on imbalanced data — a model scoring negatives at 0.01 and positives at 0.02 can have high AUC |
| PR-AUC | Area under Precision-Recall curve | Imbalanced classes (fraud, rare disease, anomaly detection) | Lower is harder to interpret than ROC-AUC; use as relative comparison |
| Matthews Correlation Coefficient (MCC) | (TP·TN - FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)) | Best single metric for binary classification with any class balance | Less intuitive than F1 but statistically superior |
Evaluation Metrics — Confusion Matrix to Business Impact
ROC-AUC vs PR-AUC — Why They Diverge on Imbalanced Data
Threshold Selection — The Forgotten Step
Most practitioners train a classifier, look at AUC, and call it done. But AUC is threshold-independent — you still need to pick an operating threshold to deploy. The right threshold depends on the business problem:
Expected Value Maximization: for fraud detection, the optimal threshold maximizes: E[value] = TP_rate × V(catch_fraud) - FP_rate × C(block_legit). If catching fraud is worth $500 but blocking a legitimate customer costs $50, set threshold to equalize marginal gains.
F-score at threshold: plot F1 (or Fβ) vs threshold. Pick the maximum. This is equivalent to maximizing precision-recall tradeoff for a given β.
Youden's J statistic: J = Sensitivity + Specificity - 1 = TPR - FPR. Maximizing J maximizes the ROC curve's 'corner' distance from the diagonal. Works for balanced classes.
Precision at k: in search/recommendation, evaluate Precision@10 — are the top 10 results relevant? Threshold is implicit in the cutoff k.
Threshold Optimization from Scratch
import numpy as np
from sklearn.metrics import precision_recall_curve, roc_curve, f1_score
def find_optimal_threshold(y_true, y_proba, method="f1", beta=1.0,
tp_value=500.0, fp_cost=50.0):
"""
Find operating threshold using different criteria.
method options:
"f1" — maximize F-beta score
"youden" — maximize TPR-FPR (good for balanced)
"ev" — maximize expected value (use tp_value, fp_cost)
"prec_at_recall" — max precision subject to recall >= target
"""
thresholds = np.linspace(0, 1, 300)
if method == "f1":
scores = []
for t in thresholds:
y_pred = (y_proba >= t).astype(int)
scores.append(f1_score(y_true, y_pred, beta=beta, zero_division=0))
best_t = thresholds[np.argmax(scores)]
elif method == "youden":
fpr, tpr, roc_thresh = roc_curve(y_true, y_proba)
j = tpr - fpr
best_t = roc_thresh[np.argmax(j)]
elif method == "ev":
n_pos = y_true.sum()
n_neg = len(y_true) - n_pos
ev_scores = []
for t in thresholds:
y_pred = (y_proba >= t).astype(int)
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
ev = tp * tp_value - fp * fp_cost
ev_scores.append(ev)
best_t = thresholds[np.argmax(ev_scores)]
elif method == "prec_at_recall":
# max precision where recall >= target
target_recall = 0.8
prec, rec, pr_thresh = precision_recall_curve(y_true, y_proba)
valid = rec >= target_recall
if valid.any():
best_t = pr_thresh[valid[:-1].argmax()] # highest precision in valid range
else:
best_t = 0.5
return best_t
Regression Metrics Reference
| Metric | Formula | Penalizes Large Errors? | Use When |
|---|---|---|---|
| MAE | mean(|y − ŷ|) | No (linear) | Robust to outliers, interpretable in units of target. 'Average prediction is off by $X'. |
| RMSE | √mean((y − ŷ)²) | Yes (quadratic) | When large errors are especially costly (financial risk, safety systems). Same units as target. |
| MAPE | mean(|y − ŷ| / |y|) × 100% | No | When relative error matters more than absolute. Warning: undefined when y=0; biased downward. |
| SMAPE | mean(2|y-ŷ|/(|y|+|ŷ|)) × 100% | No | Symmetric MAPE, handles y=0 better. Still biased toward under-prediction. |
| R² (R-squared) | 1 − SS_res/SS_tot | No | Fraction of variance explained. Can be negative (model worse than mean). Misleading on non-linear data. |
| Huber Loss | quadratic if |r|≤δ, else linear | Partial | Robust regression — acts like MSE for small errors, MAE for outliers. δ is a hyperparameter. |
| Pinball (Quantile) | max(τ·r, (τ-1)·r) | No | Quantile regression / prediction intervals. τ=0.9 → predict 90th percentile of distribution. |
Ranking Metrics — NDCG, MAP, MRR
MRR — Mean Reciprocal Rank
MRR = (1/|Q|) Σ 1/rank_first_relevant. Measures where the first relevant result appears. If the relevant result is rank 1: 1.0. Rank 2: 0.5. Rank 5: 0.2. Best for: single-answer questions, site search, QA systems. Limitation: only cares about the FIRST relevant result — ignores all others.
MAP — Mean Average Precision
AP = Σₖ P(k) × rel(k) / |relevant_docs|, where P(k) = precision at cutoff k, rel(k) = 1 if item at rank k is relevant. MAP = mean(AP) over queries. Good for: multi-relevant document retrieval. Weights early relevant results higher than late ones. Limitation: assumes binary relevance (relevant/not relevant).
NDCG — Normalized Discounted Cumulative Gain
DCG = Σₖ (2^rel_k - 1) / log₂(k+1). The log₂(k+1) discounts lower positions. NDCG = DCG / IDCG (ideal DCG). Allows graded relevance (0=not relevant, 1=somewhat, 2=highly relevant) — key advantage over MAP. Best for: search ranking, recommendation systems where some results are 'better' than others. NDCG@10 measures quality of top 10 results.
Precision@K and Recall@K
P@K = fraction of top-K results that are relevant. R@K = fraction of ALL relevant items that appear in top-K. P@10 common in web search. R@K used in recommendation (did we surface the relevant items?). These are the most interpretable ranking metrics for stakeholders.
NDCG Implementation from Scratch
import numpy as np
def dcg_at_k(relevances: list, k: int) -> float:
"""
Discounted Cumulative Gain at K.
relevances: list of relevance scores (0, 1, 2, ...) in ranked order
"""
relevances = np.array(relevances[:k], dtype=float)
if len(relevances) == 0:
return 0.0
# Position 1-indexed: positions 1, 2, ..., k
positions = np.arange(1, len(relevances) + 1)
discounts = np.log2(positions + 1) # log2(2)=1, log2(3)≈1.58, ...
gains = (2 ** relevances - 1) / discounts
return gains.sum()
def ndcg_at_k(relevances: list, k: int) -> float:
"""
Normalized DCG at K: DCG / IDCG.
IDCG = DCG of the ideal (perfectly sorted) ranking.
"""
ideal = sorted(relevances, reverse=True) # best possible ordering
idcg = dcg_at_k(ideal, k)
if idcg == 0:
return 0.0
return dcg_at_k(relevances, k) / idcg
# Example: search results with graded relevance
# 2=highly relevant, 1=somewhat, 0=not relevant
actual_order = [2, 0, 1, 0, 2] # what model returned
ideal_order = [2, 2, 1, 0, 0] # best possible order
print(f"NDCG@5: {ndcg_at_k(actual_order, k=5):.4f}") # ≈ 0.891
print(f"NDCG@3: {ndcg_at_k(actual_order, k=3):.4f}") # ≈ 0.796
def mean_reciprocal_rank(queries_results: list[list[bool]]) -> float:
"""MRR over a list of query results. Each inner list = [relevant?, ...] in rank order."""
reciprocal_ranks = []
for results in queries_results:
for rank, is_relevant in enumerate(results, 1):
if is_relevant:
reciprocal_ranks.append(1.0 / rank)
break
else:
reciprocal_ranks.append(0.0) # no relevant result found
return np.mean(reciprocal_ranks)
ROC-AUC vs PR-AUC on Imbalanced Data — The Classic Trap
With ~1% positive rate (fraud), ROC-AUC of 0.99 sounds great but is misleading. Here's why: at ~1% positive rate, even a random classifier that correctly ranks just slightly better than random will have very high ROC-AUC because there are so many negatives that TN accumulates rapidly. The FPR denominator (FP / ALL NEGATIVES) is divided by ~99% of the data, making FPR very forgiving. A classifier that scores all negatives at 0.01 and all positives at 0.02 has high ROC-AUC but terrible PR-AUC because precision at any recall is extremely low. Always use PR-AUC when positive class is < 10% of data. This result is formalized in Saito & Rehmsmeier, 2015, which shows PR-AUC is more informative than ROC-AUC for class-imbalanced problems. This is a common interview trap.
Model Calibration — When Probabilities Must Be Trusted
A model is well-calibrated if its predicted probabilities match empirical frequencies: of all samples where the model predicts P(positive) = 0.7, about 70% should actually be positive. Calibration matters in: insurance/credit (probability directly drives pricing), medical diagnosis (probability informs treatment decisions), ad bidding (probability × ad value determines bid), confidence-weighted ensembles.
Calibration Diagnostics:
- Reliability Diagram: bucket predictions into 10 bins (0–0.1, 0.1–0.2, …). Plot mean predicted probability vs. fraction of positives in each bucket. Perfect calibration = diagonal line.
- Expected Calibration Error (ECE): ECE = Σᵦ (|Bᵦ|/n) × |acc(Bᵦ) - conf(Bᵦ)|. Weighted average of accuracy-confidence gap per bin.
- Brier Score: mean squared error of probabilities. Lower = better. Brier = (1/n) Σ (p - y)². Decomposition: Brier = Uncertainty - Resolution + Calibration.
Calibration Methods:
- Platt Scaling: fit a logistic regression on top of model's raw scores on a held-out set. Best for SVMs and other non-probabilistic models.
- Isotonic Regression: non-parametric monotonic calibration. More flexible than Platt, but needs more data.
- Temperature Scaling (for neural nets): divide logits by scalar T before softmax. Single parameter, maintains accuracy while improving calibration. Modern standard for deep networks.
Calibration Check and Platt Scaling
import numpy as np
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
def expected_calibration_error(y_true, y_proba, n_bins=10):
"""
ECE: weighted average gap between predicted confidence and actual accuracy.
Lower is better. ECE < 0.05 is generally considered well-calibrated.
"""
bins = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(y_true)
for i in range(n_bins):
mask = (y_proba >= bins[i]) & (y_proba < bins[i + 1])
if mask.sum() == 0:
continue
bin_conf = y_proba[mask].mean() # mean predicted probability
bin_acc = y_true[mask].mean() # actual fraction positive
ece += (mask.sum() / n) * abs(bin_conf - bin_acc)
return ece
def temperature_scaling(logits, y_val, n_iter=100, lr=0.01):
"""
Temperature scaling: calibrate deep network by finding scalar T
that minimizes NLL on a held-out validation set.
Intuition: T > 1 → soften probabilities (overconfident model)
T < 1 → sharpen probabilities (underconfident model)
"""
T = 1.0
for _ in range(n_iter):
p = np.exp(logits / T) / np.exp(logits / T).sum(axis=1, keepdims=True)
# gradient of NLL w.r.t. T
nll_grad = -np.mean(
(y_val - p) * logits / (T ** 2)
)
T -= lr * nll_grad
T = max(0.1, T) # prevent T from going negative
return T
# Using sklearn's built-in calibration
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV
base_model = GradientBoostingClassifier()
# method='sigmoid' = Platt scaling, method='isotonic' = isotonic regression
calibrated_model = CalibratedClassifierCV(base_model, method='sigmoid', cv=5)
Multiclass Metrics — Macro vs Micro vs Weighted
| Averaging | How Computed | Use When |
|---|---|---|
| Micro-average | Pool all TP, FP, FN across classes. Then compute P/R/F1. | When overall performance across all samples matters. Dominated by majority class. |
| Macro-average | Compute P/R/F1 per class. Then take unweighted mean. | When each class is equally important regardless of size. Better for imbalanced multiclass. |
| Weighted-average | Compute P/R/F1 per class. Mean weighted by class support (sample count). | Default sklearn setting. Balances between micro and macro. Good general-purpose. |
| Per-class (OvR) | One class vs. all others at a time. | When debugging which specific class is performing poorly. |
Interview Scenario: Metric Selection for Medical Diagnosis
Setup: Binary classifier for detecting rare cardiac anomaly. ~0.5% of patients have the condition (illustrative). 50,000 patients/year scanned.
Wrong choice — Accuracy: ~99.5% by predicting everyone negative. Useless.
Wrong choice — ROC-AUC: Could be 0.97 and still miss most true positives if calibrated wrong.
Right choice — PR-AUC: Directly measures quality on the rare class. Plot the full P-R curve and decide: 'We want at least 90% Recall (miss at most 10% of true cases), and we can tolerate Precision of 30% (70% false alarm rate, acceptable given the alternative).'
Right operating threshold: find threshold where Recall >= 0.90, then pick the highest Precision point on the P-R curve satisfying that constraint.
Calibration: The cardiologist needs to know if P=0.8 means 80% likely positive or just 'high risk'. Use Platt scaling on a held-out set. Report ECE alongside AUC in model card.
Business metric: downstream, track: (missed anomalies per 1000 scans) vs (unnecessary follow-ups per 1000 scans). These drive actual cost-benefit.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →