Most MLSD resources treat model output as a ranking score and stop there. This guide covers why calibration is a first-class engineering concern — covering ECE, Platt scaling, isotonic regression, reliability diagrams, and why uncalibrated scores silently destroy revenue in ads, fraud, and pricing systems.

25 min read 2 sections 1 interview questions

Probability CalibrationPlatt ScalingIsotonic RegressionExpected Calibration ErrorECEReliability DiagramLightGBMXGBoostCalibration DriftAd BiddingFraud ScoringML ProductionModel Quality

Why Calibration Is a Production Engineering Problem, Not a Research Detail

A classifier's output is a real number in [0, 1]. Most training objectives (cross-entropy, log-loss) produce scores that are monotonically correlated with true probability — a higher score means higher probability — but the absolute values are often miscalibrated: a score of 0.8 does not mean there is an 80% chance of the positive class.

For systems that use scores as inputs to downstream decisions, this matters enormously:

Ad bidding: bid price = P(click) × P(conversion|click) × advertiser_value. If P(click) is 2× the true probability, the system overbids on every auction and bleeds money. Google's ad team found that a 1% miscalibration in CTR prediction translates to ~$100M/year in suboptimal bids at scale.
Fraud scoring: if the model's output at threshold 0.9 corresponds to a true precision of only 0.7, the business is blocking 30% of legitimate users it thinks are fraudulent — a severe trust and revenue problem.
Expected revenue optimization: ranker score = P(click) × P(buy|click) × price. Miscalibration in either probability distorts item ranking directly, causing suboptimal revenue ordering independent of ranking quality.
Risk-adjusted lending: credit score must map to a probability of default for regulatory compliance. Raw model scores are not probabilities; calibration is legally required.

The counterintuitive insight: you can have a model with perfect ranking ability (AUC = 1.0) that is completely uncalibrated. AUC only measures whether P(positive) > P(negative) for all pairs — it says nothing about whether P(positive) = 0.7 actually means 70% of those examples are positive. This is the gap that calibration fills.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade