Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

Probability Calibration: When Your Model's Probabilities Actually Mean Something

A senior-level ML differentiator most prep resources skip. Covers why calibration matters for expected-value decisions (ad bidding, fraud risk, medical scoring), how to measure miscalibration (ECE, Brier, reliability diagrams), calibration methods (Platt, isotonic, temperature scaling), and why modern deep networks are systematically overconfident.

50 min read 3 sections 1 interview questions
Probability CalibrationTemperature ScalingPlatt ScalingIsotonic RegressionExpected Calibration ErrorBrier ScoreReliability DiagramLabel SmoothingDeep EnsemblesCalibratedClassifierCVOverconfidenceProper Scoring Rules

Why Calibration Is the Differentiator Senior Candidates Miss

Most candidates conflate discrimination (can the model rank positives above negatives?) with calibration (do the predicted probabilities match empirical frequencies?). A model can achieve AUC = 0.99 while being catastrophically miscalibrated: if it outputs P = 0.95 for cases that are actually positive 60% of the time, any downstream system that multiplies the probability by a value — ad bids, expected loss, insurance premiums, expected-value thresholds — will be systematically wrong.

The non-obvious insight: calibration is not required for argmax classification. If you only need to decide "is this spam or not", ranking is enough and miscalibrated softmax works fine. Calibration becomes load-bearing the moment probability is used as a number, not a label.

Production cases where calibration is mandatory:

  • Ad CTR prediction: expected revenue = P(click) × bid × value. A 2× overconfident CTR model overpays 2× for every impression. Facebook's 2014 "Practical Lessons from Predicting Clicks on Ads" explicitly logs calibration as a first-class deployment concern.
  • Fraud risk scores (Stripe Radar, bank AML): calibrated probability feeds rule thresholds, dollar-weighted expected loss, and regulatory reporting. An uncalibrated score makes the threshold meaningless across model versions.
  • Medical risk scoring: FDA guidance for AI/ML SaMD treats calibration as a regulatory concern — a "90% risk of cardiac event" must mean 90% empirical frequency in the reference population.
  • Model stacking and Bayesian inference: downstream models assume their inputs are probabilities. Feeding them uncalibrated logits silently breaks the pipeline.

If the interviewer asks you to design any of these systems, and you say "train XGBoost and use predict_proba", you have already failed — predict_proba on a tree ensemble returns a confidence score, not a calibrated probability.

IMPORTANT

What Interviewers Evaluate on Calibration

A 6/10 answer says "we can use Platt scaling to calibrate" and stops.

A 9/10 answer demonstrates four things in order:

  1. Distinguishes discrimination from calibration — recognizes that AUC, accuracy, and F1 are all insensitive to probability magnitude. Cites that a monotonic transform of scores preserves ranking but destroys calibration.
  2. Knows which models miscalibrate and in which direction — trees are confident at extremes, modern deep nets (> 50M params) are systematically overconfident (Guo 2017), logistic regression with cross-entropy is calibrated on the training distribution but breaks under class-prior shift or resampling.
  3. Measures calibration with the right metric — ECE for binary with adaptive binning (Nixon 2019), Brier score as a strictly proper scoring rule with reliability-resolution-uncertainty decomposition (Murphy 1973), classwise ECE for multi-class. Knows that top-label ECE hides per-class miscalibration.
  4. Picks the right calibrator for the situation — temperature scaling for neural nets (preserves argmax, one parameter), Platt for small data or sigmoid-shaped miscalibration, isotonic for > 1K calibration points and arbitrary shapes, and critically: held-out calibration set, never training data.

Staff-level signal: mentioning that calibration does not survive distribution shift — you must monitor ECE in production and trigger recalibration on drift, not recalibrate once at training time.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.