Sections
Related Guides
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
ML System Design
Experiment Tracking & Model Registry: The Version Control for ML
ML System Design
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
Production ML evaluation is fundamentally different from offline evaluation. Covers shadow deployment, champion-challenger A/B testing, canary rollouts, SLO design for ML systems, rollback triggers, and the metrics that reveal model degradation before users notice. The end-to-end playbook for safely deploying and monitoring ML models.
Why Production ML Evaluation Is Different
Deploying a new ML model is not like deploying a new API endpoint. A model that improves AUC by 3% offline can decrease revenue by 2% online. A model that passes all CI checks can silently degrade 6 weeks later as the world changes. The evaluation problem doesn't end at deployment — it begins there.
Three facts make ML deployment uniquely dangerous:
-
Offline metrics don't predict online impact. AUC, NDCG, and F1 are computed on a static held-out set. Online, users respond dynamically — exposure to a new recommendation changes what they click next, which changes the training data, which changes the next model. The feedback loop is absent in offline evaluation.
-
Model failures are often silent. A failing API returns a 500 error. A degraded model returns a confident but wrong prediction — no error, no alarm, just quietly worse decisions. The system keeps running while business metrics erode.
-
The right model for yesterday may be wrong today. Distribution shift means a model's accuracy degrades over time even without any code changes. What you shipped 3 months ago needs to be actively maintained, not just monitored.
The solution: A layered deployment and monitoring strategy — shadow mode before exposure, canary before full traffic, A/B test before commit, and continuous monitoring after.
What Earns Each Level on This Topic
6/10: Knows A/B testing exists. Mentions "compare the new model to the old one." Describes offline evaluation metrics.
8/10: Describes shadow deployment before A/B, canary rollout with traffic gating, and rollback triggers. Mentions metric hierarchy (business → proxy → technical).
10/10: Designs the full evaluation lifecycle — shadow mode → canary → A/B with proper power analysis → full rollout. Explains SLO design for ML (what's the SLO on model accuracy? how do you measure it in real-time without ground truth?). Discusses champion-challenger continuous evaluation in production. Names specific tools (Evidently AI, Seldon, MLflow, SageMaker Model Monitor). Addresses label delay problem: how do you evaluate a fraud model when true fraud labels arrive 30 days later?
The ML Deployment Evaluation Lifecycle
Phase 1 — Shadow Deployment (0% user impact)
Run the new model in parallel with the champion model on 100% of real traffic. The shadow model's predictions are logged but never served. Compare: prediction distribution (is the new model scoring differently?), latency (does it fit the SLA?), feature pipeline dependencies (are all features available?). Duration: 24–72 hours. Gate: shadow latency P99 < 200ms AND prediction distribution stable. Cost: 2× serving compute while shadow runs.
Phase 2 — Canary Rollout (1–5% traffic)
Direct a small slice of real traffic to the new model and serve its predictions. Monitor: error rate (< 0.1%), latency P99 (< SLA), prediction distribution (consistent with shadow phase). Gate: no anomalies after 4–24 hours at 1% traffic. Rollback trigger: automatic if error rate spikes > 1% for 5 consecutive minutes or latency P99 > 2× baseline for 10 minutes.
Phase 3 — A/B Test (10–50% traffic)
Proper experiment: control (champion model) vs treatment (challenger model). Traffic split: 50/50 for fastest statistical convergence; use smaller treatment fractions only if side effects are asymmetric. Minimum run duration: calculated from power analysis (typically 1–2 weeks to detect a 2% change in primary metric with 80% power, α=0.05). Guard rail metrics: user harm metrics (complaints, error rates) must not degrade. Primary metric: business objective (CTR, revenue, session length). Two-sided test for surprise effects.
Phase 4 — Full Rollout and Champion-Challenger
After A/B success, ramp to 100% traffic. The old model doesn't get turned off immediately — set up champion-challenger monitoring where the new champion serves traffic but 5% is shadowed to the old model for 2 weeks. This gives a rollback baseline and catches delayed degradation. After 2 weeks of stability, decommission the old model.
ML Deployment Pipeline: Shadow → Canary → A/B → Full Rollout
SLO Design for ML Models
Standard SLOs (latency, error rate, availability) are necessary but not sufficient for ML systems. You need ML-specific SLOs that track model health, not just system health.
The three-layer ML SLO hierarchy:
Layer 1 — Infrastructure SLOs (latency, availability):
- Serving latency: P50 < 50ms, P99 < 200ms
- Error rate: < 0.1% of predictions fail
- Feature pipeline freshness: features updated within 5 minutes of event time These are standard SRE SLOs. Breach these and the model is broken, full stop.
Layer 2 — Model health SLOs (prediction distribution):
- Score distribution stability: mean predicted score within 15% of 30-day rolling average
- Prediction volume: predictions/second within 20% of baseline (detects upstream issues silently killing traffic)
- Feature null rate: < 5% null values per critical feature (feature pipeline degradation) These don't require ground truth labels — you can monitor them in real-time.
Layer 3 — Business SLOs (with label delay):
- Primary metric: CTR, revenue, conversion — compare 7-day rolling average week-over-week
- Proxy metric: click-through rate or dwell time as a leading indicator before conversion data
- Delayed ground truth: for fraud detection (30-day label delay) or credit risk (6-month label delay), use proxy signals — disputed transactions, manual review flags — as early warning.
The label delay problem: You cannot measure fraud model accuracy today because you don't know which transactions are truly fraudulent until next month. Fix: build proxy-label systems. Use fast-feedback signals (chargebacks filed within 72 hours, card holder reports) as your real-time accuracy proxy. At Meta, Instagram feed ranking uses "long-term user satisfaction" labels (surveyed weekly) as the ground truth, but 7-second video dwell time as the real-time proxy metric.
A/B Testing for ML: Key Decisions and Common Mistakes
| Decision | Wrong approach | Right approach |
|---|---|---|
| Traffic split | 90/10 to limit exposure | 50/50 whenever possible — halves the experiment duration for the same power |
| Experiment duration | Run until p < 0.05 | Pre-calculate sample size from power analysis. Stopping early (peeking) inflates false positive rate by 2-3× |
| Primary metric | Use offline metric (AUC) | Use the business objective the model was built to improve (CTR, revenue, session length) |
| Guard rail metrics | Only track primary metric | Always include user harm metrics: complaint rate, unsubscribe rate, error rate. These must not degrade even if primary metric improves |
| Novelty effect | Declare winner after 3 days | Run for at least 1 full week to let novelty effect wash out. Users click on anything new for 24-48 hours |
| Variance reduction | Simple mean comparison | Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by 20-50%, shortening experiment duration |
| Network effects | Treat users as independent | Use cluster-based randomization if users interact (social networks, marketplaces) — standard A/B assigns the same user's friends to different variants, contaminating both groups |
Rollback: When and How
Every model deployment needs a rollback plan designed before deployment, not after an incident.
Automatic rollback triggers (no human required):
- Latency P99 exceeds 2× baseline for 10 consecutive minutes
- Error rate exceeds 1% for 5 consecutive minutes
- Prediction volume drops > 30% of expected (upstream breakage)
Human-triggered rollback signals:
- Prediction distribution shifts > 2σ from 7-day rolling average
- Primary business metric drops > 5% week-over-week with no external explanation
- Guard rail metric (complaint rate, refund rate) spikes > 2σ
Rollback mechanism: Traffic routing, not model replacement. Keep both model versions deployed. A rollback is a weight change in the traffic router from new: 100%, old: 0% to new: 0%, old: 100%. This takes < 30 seconds, not a re-deploy. Seldon Core, BentoML, and AWS SageMaker all support traffic weight routing natively.
The 5-minute rollback SLO: Any ML team should be able to restore the champion model within 5 minutes of a rollback decision. If rollback requires a rebuild or re-deploy, it will take 30+ minutes — during which users are getting worse predictions. Blue-green deployment (keep old model warm) is the prerequisite.
Post-rollback diagnosis: Run a diff between the shadow logs from the new model and the champion model for the period when degradation was detected. Look for: which user segments had the largest prediction divergence? Which features had the highest PSI drift right before degradation? This tells you why the new model failed in production when the offline metrics were fine.
Most Common Production ML Evaluation Mistakes
1. Skipping shadow deployment. Going straight from offline evaluation to canary. Shadow deployment catches latency regressions and feature pipeline dependencies that offline evaluation cannot.
2. Stopping the A/B test when p < 0.05. Peeking at results daily and stopping when significance is reached inflates the false positive rate. The false positive rate at p=0.05 with daily peeking over a 2-week test is ~30%, not 5%.
3. No guard rail metrics. Running an A/B where the primary metric (clicks) improves but complaint rate also increases. This is a harmful model that scores well on the wrong metric.
4. Using offline AUC as the A/B primary metric. AUC measures ranking quality on a static holdout set. Online, users react to the ranking, changing future interactions. Always A/B on business metrics, not offline metrics.
5. Decommissioning the old model immediately after rollout. Keep the champion warm for 2 weeks. If something goes wrong, you need a 30-second rollback, not a 30-minute re-deploy.
6. No label-delay-aware monitoring. For fraud, healthcare, or credit risk models, naive accuracy monitoring gives a false "everything is fine" signal for weeks while the model degrades. Always design proxy-label monitoring for high-delay domains.
Interview Delivery Summary
When asked "how would you deploy and monitor a new ML model?", use the 4-phase lifecycle: Shadow (zero user impact, validate latency + distribution) → Canary (1-5%, real users, automatic rollback) → A/B test (pre-powered, 50/50, business metric primary, guard rails) → Full rollout + champion-challenger.
Then add the three-layer SLO hierarchy (infra → model health → business) and the label delay problem if relevant to the domain. Finish with the 5-minute rollback SLO and traffic-routing vs re-deploy distinction.
The staff signal: bring up guard rail metrics and the label delay problem before the interviewer asks. These are the gaps that distinguish mid-level from staff candidates.