Skip to main content
ML System Design·Intermediate

ML Monitoring & Drift Detection: Keeping Models Healthy in Production

Production ML models fail silently. This guide covers the three-layer monitoring stack (data drift, concept drift, output drift), PSI thresholds, KL divergence, distinguishing data drift from concept drift (they require different fixes), and how to build retraining triggers that aren't noisy.

30 min read 8 sections 6 interview questions
ML MonitoringData DriftConcept DriftPSIKL DivergenceModel DegradationRetrainingEvidently AIFeature MonitoringOutput DistributionFeedback LoopLabel DelayMLOps

Why Models Degrade — The Three Root Causes

A model deployed today will be wrong in 6 months. Not because the model was badly built, but because the world changes while the model stays fixed. Understanding the three root causes determines what you monitor and how you fix it.

Data drift (input distribution shift): The distribution of input features at serving time diverges from the training distribution. The model was trained on data from users aged 25–34 who primarily used desktop. Now 60% of traffic is from mobile users aged 18–24. Feature distributions like session_duration, device_type, and screen_size have shifted. The model is extrapolating to out-of-distribution inputs. The underlying relationship between features and labels hasn't changed — the model just hasn't seen this type of input.

Fix: Retrain on data that includes the new distribution. The model's architecture and feature set are fine.

Concept drift (target distribution shift): The statistical relationship between inputs and the correct output has changed. A fraud model trained on 2022 fraud patterns becomes less accurate as fraudsters adopt new techniques in 2023. The input features (transaction amounts, merchant types) may look similar, but what constitutes fraud has changed. The feature distributions may look stable, but model accuracy drops.

Fix: Relabeling required. Retraining on new data with updated labels. Sometimes the feature set itself needs rethinking.

Model staleness (domain shift over time): Gradual changes in the world — seasonality, product launches, user behavior evolution — make a model that was calibrated for last year increasingly misaligned. The model isn't 'wrong' in the sense of concept drift, but its weight assignments no longer reflect current reality.

Fix: Regular retraining cadence (weekly or monthly) to keep model weights current.

Setting Up ML Monitoring — What to Instrument and When to Alert

01

Layer 1: Infrastructure health (alert immediately)

Monitor: prediction service latency (P50, P95, P99), error rate (HTTP 5xx, model inference failures), feature pipeline lag (how stale are features?), model serving throughput. These are table stakes — if the model isn't serving, nothing else matters. Alert threshold: latency P99 > 2× baseline OR error rate > 0.1% for 5 minutes.

02

Layer 2: Input feature drift (alert within 1 hour)

Compute PSI (Population Stability Index) for every feature daily: compare today's feature distribution vs the training distribution. PSI < 0.1 = stable. PSI 0.1–0.25 = monitor. PSI > 0.25 = alert and investigate before the next model update. Critical features (high SHAP importance) should have tighter thresholds. Do not alert on all features simultaneously — rank by feature importance and tier alert urgency accordingly.

03

Layer 3: Prediction distribution drift (alert within 4 hours)

Monitor the model's output score distribution: histogram of predicted probabilities or regression values, mean and variance of scores. If the average fraud score suddenly drops from 0.12 to 0.04 while transaction volume is stable, something changed upstream. This is faster to detect than waiting for ground truth labels, which may take days to weeks.

04

Layer 4: Business metric monitoring (daily review)

Track the metrics the model was built to improve: fraud rate, recommendation CTR, ETA accuracy, conversion rate. Compare rolling 7-day averages week-over-week. These require human review — a business metric change has many possible causes (seasonality, product changes, model drift). Set statistical process control (SPC) control limits and flag when metrics exceed 2σ from the rolling baseline.

05

Retraining trigger policy: scheduled vs triggered

Default: scheduled weekly retraining regardless of drift signals. Triggered (faster): if PSI for any high-importance feature exceeds 0.25 OR business metric drops >5% week-over-week → trigger emergency retraining within 24 hours. Online fine-tuning (warm-start from current weights, 20-50 gradient steps): for models with rapid-feedback signals (demand forecasting, CTR) that can adapt within hours of detected drift.

Three-Layer ML Monitoring Stack

Rendering diagram...

PSI — The Standard Feature Drift Metric

Population Stability Index (PSI) is the most widely used metric for detecting feature distribution drift. It's symmetric (unlike KL divergence), simple to compute, and produces interpretable thresholds.

Formula: PSI = Σ (P_current - P_reference) × ln(P_current / P_reference)

Where P_current and P_reference are the proportions in each bin of the feature distribution for the current production window vs the reference (training data).

Interpretation thresholds (industry standard, from finance/credit industry):

  • PSI < 0.10: No significant shift. Model is stable. No action required.
  • 0.10 ≤ PSI < 0.25: Moderate shift. Monitor more closely. Investigate if combined with performance degradation.
  • PSI ≥ 0.25: Significant shift. Model is likely operating outside its training distribution. Trigger retraining review.

Implementation:

import numpy as np

def compute_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
    # Bin boundaries from reference distribution
    breakpoints = np.percentile(reference, np.linspace(0, 100, bins + 1))
    ref_counts = np.histogram(reference, breakpoints)[0] / len(reference)
    cur_counts = np.histogram(current, breakpoints)[0] / len(current)
    # Add small epsilon to avoid log(0)
    ref_counts = np.clip(ref_counts, 1e-6, None)
    cur_counts = np.clip(cur_counts, 1e-6, None)
    psi = np.sum((cur_counts - ref_counts) * np.log(cur_counts / ref_counts))
    return psi

Limitation of PSI: PSI measures each feature independently. It cannot detect multivariate drift — where individual features appear stable but their joint distribution has changed. For high-dimensional models, complement PSI with a 'drift classification model': train a binary classifier to distinguish training data from current production data. High AUC of this classifier → multivariate drift exists.

Alternative metrics for specific feature types:

  • Continuous features: Kolmogorov-Smirnov (K-S) test (returns p-value), Wasserstein distance
  • Categorical features: Chi-squared test, Jensen-Shannon divergence
  • High-cardinality categorical: % of new categories in production that didn't appear in training

Concept Drift — The Hard Problem

Concept drift is harder to detect than data drift because it requires labels, which often arrive with delay. A fraud model might be suffering concept drift for 45 days before you know it — because fraud chargebacks take that long to arrive.

Strategy 1 — Delayed label monitoring: For systems with delayed labels (fraud, conversion), batch the evaluation: every Monday, pull labels for events from 45 days ago. Compute Precision@K and AUC-PR on that week's data. Compare to the model's performance at deployment time. If AUC-PR drops > 3% from baseline: trigger retraining review.

Strategy 2 — Proxy label monitoring: For systems with slow labels, identify fast-moving proxy labels that correlate. For e-commerce fraud: chargebacks take 30 days, but customer disputes take 7 days, and merchant blocks often happen within 24 hours. Monitor merchant block rate as a leading indicator. If merchant blocks spike, fraud is rising — retrain before chargebacks confirm it.

Strategy 3 — Behavioral testing / ChallengeSet evaluation: Maintain a handcrafted test set of representative inputs with known correct outputs, updated by domain experts. Evaluate the model on this test set weekly. When a new fraud pattern emerges, add test cases. If the model fails on new test cases before labels arrive in production, trigger retraining.

Strategy 4 — Output-based drift detection: Even without labels, the model's output distribution can signal concept drift. If a fraud model's average predicted fraud probability suddenly drops (model is less certain → fewer flags) while actual fraud volumes (from merchant reports) are stable, the model is becoming underconfident on current data patterns. This triggers an investigation even before labels arrive.

The difference matters for the fix: Data drift → retrain on new distribution, keep the same feature set and architecture. Concept drift → investigate whether the feature set captures the new patterns; may require new features or a fundamentally different model architecture.

Retraining Strategy — When and How to Retrain

Trigger TypeWhen to UseRetrain LatencyRisk
Scheduled (weekly)Fast-changing domains: news, trending content, real-time adsPredictable, plannedLow — model is refreshed regardless of drift
Scheduled (monthly)Slower-changing domains: product recommendations, user preferencesPlanned, lower compute costModerate — may miss fast drift between cycles
PSI threshold (PSI > 0.25 on > 20% of features)Feature distribution shift detectedReactive — hours to daysLow if monitoring is reliable; alert fatigue risk if threshold too sensitive
Performance threshold (AUC drops > 3%)"Direct model degradation detected via delayed labelsReactive — triggered when degradation confirmedHigher — may already be degraded for days before labels arrive
Event-based (product launchseasonal shift)Major product change or known distribution shiftProactive — before drift manifestsLow — prepared in advance
Continuous online learningReal-time or near-real-time model updates with streaming dataImmediateHigh — risk of catastrophic forgetting, requires careful guardrails
⚠ WARNING

Alert Fatigue — Why Most Monitoring Systems Fail

A monitoring system that fires 50 alerts per day will be ignored within two weeks. Teams develop alert blindness and stop responding — until a real crisis is missed.

Design principles for effective ML monitoring alerts:

  1. Severity tiers: Info (log and review weekly), Warning (investigate within 48 hours), Critical (page someone now). Not every PSI > 0.25 is a Critical — context matters. A PSI of 0.28 on a secondary feature in a stable, low-stakes model is a Warning. A PSI of 0.35 on the primary feature of a fraud model is Critical.

  2. Sliding window baselines: Don't compare against the original training distribution forever. Compare against a rolling 30-day production baseline. This naturally adapts to gradual seasonal drift and reduces false positives from gradual, acceptable shifts.

  3. Anomaly detection, not fixed thresholds: Instead of 'alert if PSI > 0.25,' use 'alert if current PSI is > 3 standard deviations above the 30-day rolling mean PSI.' This self-calibrates and fires on genuine anomalies rather than persistent but stable shifts.

  4. Combine signals before alerting: Don't alert on any single metric crossing a threshold. Alert when: PSI > 0.25 AND performance proxy is degrading AND output distribution has shifted. The combination of signals is more reliable than any single one.

TIP

Interview Framework: Monitoring Design

When designing monitoring for any ML system in an interview, cover three layers explicitly:

  1. Data drift (input): 'I'd compute PSI daily for all features against the training baseline. Alert at PSI > 0.25. Also monitor null rates per feature — a sudden increase in nulls is a schema change or data pipeline failure, not drift.'

  2. Output drift: 'Monitor the model's score distribution daily — mean, P50, P90 of predicted scores. Alert if the distribution shifts > 2 standard deviations from the 30-day rolling average. This catches model degradation before labels arrive.'

  3. Concept drift (performance with delayed labels): 'For fraud, we get ground truth in 30–90 days. Run a weekly evaluation job on events from 45 days ago and track AUC-PR vs deployment baseline. Alert if it drops > 3%. In parallel, track proxy metrics like dispute rates and merchant blocks as leading indicators.'

Then specify your retraining triggers: scheduled (weekly for fast-changing domains), threshold-based (PSI > 0.25 + performance degradation), and event-based (known product launches, seasonal shifts).

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →