Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
ML System Design
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Feature Stores: Online/Offline Architecture & Training-Serving Consistency
ML System Design
How to Design at MLSD: Blank Whiteboard to Production ML
ML System Design
Probability Calibration: When Model Scores Must Be Real Probabilities
Most MLSD resources treat model output as a ranking score and stop there. This guide covers why calibration is a first-class engineering concern — covering ECE, Platt scaling, isotonic regression, reliability diagrams, and why uncalibrated scores silently destroy revenue in ads, fraud, and pricing systems.
Why Calibration Is a Production Engineering Problem, Not a Research Detail
A classifier's output is a real number in [0, 1]. Most training objectives (cross-entropy, log-loss) produce scores that are monotonically correlated with true probability — a higher score means higher probability — but the absolute values are often miscalibrated: a score of 0.8 does not mean there is an 80% chance of the positive class.
For systems that use scores as inputs to downstream decisions, this matters enormously:
- Ad bidding: bid price = P(click) × P(conversion|click) × advertiser_value. If P(click) is 2× the true probability, the system overbids on every auction and bleeds money. Google's ad team found that a 1% miscalibration in CTR prediction translates to ~$100M/year in suboptimal bids at scale.
- Fraud scoring: if the model's output at threshold 0.9 corresponds to a true precision of only 0.7, the business is blocking 30% of legitimate users it thinks are fraudulent — a severe trust and revenue problem.
- Expected revenue optimization: ranker score = P(click) × P(buy|click) × price. Miscalibration in either probability distorts item ranking directly, causing suboptimal revenue ordering independent of ranking quality.
- Risk-adjusted lending: credit score must map to a probability of default for regulatory compliance. Raw model scores are not probabilities; calibration is legally required.
The counterintuitive insight: you can have a model with perfect ranking ability (AUC = 1.0) that is completely uncalibrated. AUC only measures whether P(positive) > P(negative) for all pairs — it says nothing about whether P(positive) = 0.7 actually means 70% of those examples are positive. This is the gap that calibration fills.