Sections
Related Guides
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Feature Stores: Online/Offline Architecture & Training-Serving Consistency
ML System Design
Data Pipelines for ML: Batch, Streaming, and Event Architecture
ML System Design
Experiment Tracking & Model Registry: The Version Control for ML
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
Production ML models fail silently. This guide covers the three-layer monitoring stack (data drift, concept drift, output drift), PSI thresholds, KL divergence, distinguishing data drift from concept drift (they require different fixes), and how to build retraining triggers that aren't noisy.
Why Models Degrade — The Three Root Causes
A model deployed today will be wrong in 6 months. Not because the model was badly built, but because the world changes while the model stays fixed. Understanding the three root causes determines what you monitor and how you fix it.
Data drift (input distribution shift): The distribution of input features at serving time diverges from the training distribution. The model was trained on data from users aged 25–34 who primarily used desktop. Now 60% of traffic is from mobile users aged 18–24. Feature distributions like session_duration, device_type, and screen_size have shifted. The model is extrapolating to out-of-distribution inputs. The underlying relationship between features and labels hasn't changed — the model just hasn't seen this type of input.
Fix: Retrain on data that includes the new distribution. The model's architecture and feature set are fine.
Concept drift (target distribution shift): The statistical relationship between inputs and the correct output has changed. A fraud model trained on 2022 fraud patterns becomes less accurate as fraudsters adopt new techniques in 2023. The input features (transaction amounts, merchant types) may look similar, but what constitutes fraud has changed. The feature distributions may look stable, but model accuracy drops.
Fix: Relabeling required. Retraining on new data with updated labels. Sometimes the feature set itself needs rethinking.
Model staleness (domain shift over time): Gradual changes in the world — seasonality, product launches, user behavior evolution — make a model that was calibrated for last year increasingly misaligned. The model isn't 'wrong' in the sense of concept drift, but its weight assignments no longer reflect current reality.
Fix: Regular retraining cadence (weekly or monthly) to keep model weights current.
Setting Up ML Monitoring — What to Instrument and When to Alert
Layer 1: Infrastructure health (alert immediately)
Monitor: prediction service latency (P50, P95, P99), error rate (HTTP 5xx, model inference failures), feature pipeline lag (how stale are features?), model serving throughput. These are table stakes — if the model isn't serving, nothing else matters. Alert threshold: latency P99 > 2× baseline OR error rate > 0.1% for 5 minutes.
Layer 2: Input feature drift (alert within 1 hour)
Compute PSI (Population Stability Index) for every feature daily: compare today's feature distribution vs the training distribution. PSI < 0.1 = stable. PSI 0.1–0.25 = monitor. PSI > 0.25 = alert and investigate before the next model update. Critical features (high SHAP importance) should have tighter thresholds. Do not alert on all features simultaneously — rank by feature importance and tier alert urgency accordingly.
Layer 3: Prediction distribution drift (alert within 4 hours)
Monitor the model's output score distribution: histogram of predicted probabilities or regression values, mean and variance of scores. If the average fraud score suddenly drops from 0.12 to 0.04 while transaction volume is stable, something changed upstream. This is faster to detect than waiting for ground truth labels, which may take days to weeks.
Layer 4: Business metric monitoring (daily review)
Track the metrics the model was built to improve: fraud rate, recommendation CTR, ETA accuracy, conversion rate. Compare rolling 7-day averages week-over-week. These require human review — a business metric change has many possible causes (seasonality, product changes, model drift). Set statistical process control (SPC) control limits and flag when metrics exceed 2σ from the rolling baseline.
Retraining trigger policy: scheduled vs triggered
Default: scheduled weekly retraining regardless of drift signals. Triggered (faster): if PSI for any high-importance feature exceeds 0.25 OR business metric drops >5% week-over-week → trigger emergency retraining within 24 hours. Online fine-tuning (warm-start from current weights, 20-50 gradient steps): for models with rapid-feedback signals (demand forecasting, CTR) that can adapt within hours of detected drift.
Three-Layer ML Monitoring Stack
PSI — The Standard Feature Drift Metric
Population Stability Index (PSI) is the most widely used metric for detecting feature distribution drift. It's symmetric (unlike KL divergence), simple to compute, and produces interpretable thresholds.
Formula: PSI = Σ (P_current - P_reference) × ln(P_current / P_reference)
Where P_current and P_reference are the proportions in each bin of the feature distribution for the current production window vs the reference (training data).
Interpretation thresholds (industry standard, from finance/credit industry):
- PSI < 0.10: No significant shift. Model is stable. No action required.
- 0.10 ≤ PSI < 0.25: Moderate shift. Monitor more closely. Investigate if combined with performance degradation.
- PSI ≥ 0.25: Significant shift. Model is likely operating outside its training distribution. Trigger retraining review.
Implementation:
import numpy as np
def compute_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
# Bin boundaries from reference distribution
breakpoints = np.percentile(reference, np.linspace(0, 100, bins + 1))
ref_counts = np.histogram(reference, breakpoints)[0] / len(reference)
cur_counts = np.histogram(current, breakpoints)[0] / len(current)
# Add small epsilon to avoid log(0)
ref_counts = np.clip(ref_counts, 1e-6, None)
cur_counts = np.clip(cur_counts, 1e-6, None)
psi = np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))
return psi
Limitation of PSI: PSI measures each feature independently. It cannot detect multivariate drift — where individual features appear stable but their joint distribution has changed. For high-dimensional models, complement PSI with a 'drift classification model': train a binary classifier to distinguish training data from current production data. High AUC of this classifier → multivariate drift exists.
Alternative metrics for specific feature types:
- Continuous features: Kolmogorov-Smirnov (K-S) test (returns p-value), Wasserstein distance
- Categorical features: Chi-squared test, Jensen-Shannon divergence
- High-cardinality categorical: % of new categories in production that didn't appear in training
Concept Drift — The Hard Problem
Concept drift is harder to detect than data drift because it requires labels, which often arrive with delay. A fraud model might be suffering concept drift for 45 days before you know it — because fraud chargebacks take that long to arrive.
Strategy 1 — Delayed label monitoring: For systems with delayed labels (fraud, conversion), batch the evaluation: every Monday, pull labels for events from 45 days ago. Compute Precision@K and AUC-PR on that week's data. Compare to the model's performance at deployment time. If AUC-PR drops > 3% from baseline: trigger retraining review.
Strategy 2 — Proxy label monitoring: For systems with slow labels, identify fast-moving proxy labels that correlate. For e-commerce fraud: chargebacks take 30 days, but customer disputes take 7 days, and merchant blocks often happen within 24 hours. Monitor merchant block rate as a leading indicator. If merchant blocks spike, fraud is rising — retrain before chargebacks confirm it.
Strategy 3 — Behavioral testing / ChallengeSet evaluation: Maintain a handcrafted test set of representative inputs with known correct outputs, updated by domain experts. Evaluate the model on this test set weekly. When a new fraud pattern emerges, add test cases. If the model fails on new test cases before labels arrive in production, trigger retraining.
Strategy 4 — Output-based drift detection: Even without labels, the model's output distribution can signal concept drift. If a fraud model's average predicted fraud probability suddenly drops (model is less certain → fewer flags) while actual fraud volumes (from merchant reports) are stable, the model is becoming underconfident on current data patterns. This triggers an investigation even before labels arrive.
The difference matters for the fix: Data drift → retrain on new distribution, keep the same feature set and architecture. Concept drift → investigate whether the feature set captures the new patterns; may require new features or a fundamentally different model architecture.
Retraining Strategy — When and How to Retrain
| Trigger Type | When to Use | Retrain Latency | Risk | |
|---|---|---|---|---|
| Scheduled (weekly) | Fast-changing domains: news, trending content, real-time ads | Predictable, planned | Low — model is refreshed regardless of drift | |
| Scheduled (monthly) | Slower-changing domains: product recommendations, user preferences | Planned, lower compute cost | Moderate — may miss fast drift between cycles | |
| PSI threshold (PSI > 0.25 on > 20% of features) | Feature distribution shift detected | Reactive — hours to days | Low if monitoring is reliable; alert fatigue risk if threshold too sensitive | |
| Performance threshold (AUC drops > 3%)" | Direct model degradation detected via delayed labels | Reactive — triggered when degradation confirmed | Higher — may already be degraded for days before labels arrive | |
| Event-based (product launch | seasonal shift) | Major product change or known distribution shift | Proactive — before drift manifests | Low — prepared in advance |
| Continuous online learning | Real-time or near-real-time model updates with streaming data | Immediate | High — risk of catastrophic forgetting, requires careful guardrails |
Alert Fatigue — Why Most Monitoring Systems Fail
A monitoring system that fires 50 alerts per day will be ignored within two weeks. Teams develop alert blindness and stop responding — until a real crisis is missed.
Design principles for effective ML monitoring alerts:
-
Severity tiers: Info (log and review weekly), Warning (investigate within 48 hours), Critical (page someone now). Not every PSI > 0.25 is a Critical — context matters. A PSI of 0.28 on a secondary feature in a stable, low-stakes model is a Warning. A PSI of 0.35 on the primary feature of a fraud model is Critical.
-
Sliding window baselines: Don't compare against the original training distribution forever. Compare against a rolling 30-day production baseline. This naturally adapts to gradual seasonal drift and reduces false positives from gradual, acceptable shifts.
-
Anomaly detection, not fixed thresholds: Instead of 'alert if PSI > 0.25,' use 'alert if current PSI is > 3 standard deviations above the 30-day rolling mean PSI.' This self-calibrates and fires on genuine anomalies rather than persistent but stable shifts.
-
Combine signals before alerting: Don't alert on any single metric crossing a threshold. Alert when: PSI > 0.25 AND performance proxy is degrading AND output distribution has shifted. The combination of signals is more reliable than any single one.
Interview Framework: Monitoring Design
When designing monitoring for any ML system in an interview, cover three layers explicitly:
-
Data drift (input): 'I'd compute PSI daily for all features against the training baseline. Alert at PSI > 0.25. Also monitor null rates per feature — a sudden increase in nulls is a schema change or data pipeline failure, not drift.'
-
Output drift: 'Monitor the model's score distribution daily — mean, P50, P90 of predicted scores. Alert if the distribution shifts > 2 standard deviations from the 30-day rolling average. This catches model degradation before labels arrive.'
-
Concept drift (performance with delayed labels): 'For fraud, we get ground truth in 30–90 days. Run a weekly evaluation job on events from 45 days ago and track AUC-PR vs deployment baseline. Alert if it drops > 3%. In parallel, track proxy metrics like dispute rates and merchant blocks as leading indicators.'
Then specify your retraining triggers: scheduled (weekly for fast-changing domains), threshold-based (PSI > 0.25 + performance degradation), and event-based (known product launches, seasonal shifts).
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →