Sections

Production ML evaluation is fundamentally different from offline evaluation. Covers shadow deployment, champion-challenger A/B testing, canary rollouts, SLO design for ML systems, rollback triggers, and the metrics that reveal model degradation before users notice. The end-to-end playbook for safely deploying and monitoring ML models.

35 min read 9 sections 6 interview questions

Shadow DeploymentChampion-ChallengerA/B TestingML MonitoringCanary RolloutSLOModel RollbackEvaluationProduction MLOnline MetricsOffline MetricsMLOpsTraffic SplittingGradual Rollout

Why Production ML Evaluation Is Different

Deploying a new ML model is not like deploying a new API endpoint. A model that improves AUC by 3% offline can decrease revenue by 2% online. A model that passes all CI checks can silently degrade 6 weeks later as the world changes. The evaluation problem doesn't end at deployment — it begins there.

Three facts make ML deployment uniquely dangerous:

Offline metrics don't predict online impact. AUC, NDCG, and F1 are computed on a static held-out set. Online, users respond dynamically — exposure to a new recommendation changes what they click next, which changes the training data, which changes the next model. The feedback loop is absent in offline evaluation.
Model failures are often silent. A failing API returns a 500 error. A degraded model returns a confident but wrong prediction — no error, no alarm, just quietly worse decisions. The system keeps running while business metrics erode.
The right model for yesterday may be wrong today. Distribution shift means a model's accuracy degrades over time even without any code changes. What you shipped 3 months ago needs to be actively maintained, not just monitored.

The solution: A layered deployment and monitoring strategy — shadow mode before exposure, canary before full traffic, A/B test before commit, and continuous monitoring after.

IMPORTANT

What Earns Each Level on This Topic

6/10: Knows A/B testing exists. Mentions "compare the new model to the old one." Describes offline evaluation metrics.

8/10: Describes shadow deployment before A/B, canary rollout with traffic gating, and rollback triggers. Mentions metric hierarchy (business → proxy → technical).

10/10: Designs the full evaluation lifecycle — shadow mode → canary → A/B with proper power analysis → full rollout. Explains SLO design for ML (what's the SLO on model accuracy? how do you measure it in real-time without ground truth?). Discusses champion-challenger continuous evaluation in production. Names specific tools (Evidently AI, Seldon, MLflow, SageMaker Model Monitor). Addresses label delay problem: how do you evaluate a fraud model when true fraud labels arrive 30 days later?

The ML Deployment Evaluation Lifecycle

Phase 1 — Shadow Deployment (0% user impact)

Run the new model in parallel with the champion model on 100% of real traffic. The shadow model's predictions are logged but never served. Compare: prediction distribution (is the new model scoring differently?), latency (does it fit the SLA?), feature pipeline dependencies (are all features available?). Duration: 24–72 hours. Gate: shadow latency P99 < 200ms AND prediction distribution stable. Cost: 2× serving compute while shadow runs.

Phase 2 — Canary Rollout (1–5% traffic)

Direct a small slice of real traffic to the new model and serve its predictions. Monitor: error rate (< 0.1%), latency P99 (< SLA), prediction distribution (consistent with shadow phase). Gate: no anomalies after 4–24 hours at 1% traffic. Rollback trigger: automatic if error rate spikes > 1% for 5 consecutive minutes or latency P99 > 2× baseline for 10 minutes.

Phase 3 — A/B Test (10–50% traffic)

Proper experiment: control (champion model) vs treatment (challenger model). Traffic split: 50/50 for fastest statistical convergence; use smaller treatment fractions only if side effects are asymmetric. Minimum run duration: calculated from power analysis (typically 1–2 weeks to detect a 2% change in primary metric with 80% power, α=0.05). Guard rail metrics: user harm metrics (complaints, error rates) must not degrade. Primary metric: business objective (CTR, revenue, session length). Two-sided test for surprise effects.

Phase 4 — Full Rollout and Champion-Challenger

After A/B success, ramp to 100% traffic. The old model doesn't get turned off immediately — set up champion-challenger monitoring where the new champion serves traffic but 5% is shadowed to the old model for 2 weeks. This gives a rollback baseline and catches delayed degradation. After 2 weeks of stability, decommission the old model.

ML Deployment Pipeline: Shadow → Canary → A/B → Full Rollout

Rendering diagram...

SLO Design for ML Models

Standard SLOs (latency, error rate, availability) are necessary but not sufficient for ML systems. You need ML-specific SLOs that track model health, not just system health.

The three-layer ML SLO hierarchy:

Layer 1 — Infrastructure SLOs (latency, availability):

Serving latency: P50 < 50ms, P99 < 200ms
Error rate: < 0.1% of predictions fail
Feature pipeline freshness: features updated within 5 minutes of event time These are standard SRE SLOs. Breach these and the model is broken, full stop.

Layer 2 — Model health SLOs (prediction distribution):

Score distribution stability: mean predicted score within 15% of 30-day rolling average
Prediction volume: predictions/second within 20% of baseline (detects upstream issues silently killing traffic)
Feature null rate: < 5% null values per critical feature (feature pipeline degradation) These don't require ground truth labels — you can monitor them in real-time.

Layer 3 — Business SLOs (with label delay):

Primary metric: CTR, revenue, conversion — compare 7-day rolling average week-over-week
Proxy metric: click-through rate or dwell time as a leading indicator before conversion data
Delayed ground truth: for fraud detection (30-day label delay) or credit risk (6-month label delay), use proxy signals — disputed transactions, manual review flags — as early warning.

The label delay problem: You cannot measure fraud model accuracy today because you don't know which transactions are truly fraudulent until next month. Fix: build proxy-label systems. Use fast-feedback signals (chargebacks filed within 72 hours, card holder reports) as your real-time accuracy proxy. At Meta, Instagram feed ranking uses "long-term user satisfaction" labels (surveyed weekly) as the ground truth, but 7-second video dwell time as the real-time proxy metric.

A/B Testing for ML: Key Decisions and Common Mistakes

Decision	Wrong approach	Right approach
Traffic split	90/10 to limit exposure	50/50 whenever possible — halves the experiment duration for the same power
Experiment duration	Run until p < 0.05	Pre-calculate sample size from power analysis. Stopping early (peeking) inflates false positive rate by 2-3×
Primary metric	Use offline metric (AUC)	Use the business objective the model was built to improve (CTR, revenue, session length)
Guard rail metrics	Only track primary metric	Always include user harm metrics: complaint rate, unsubscribe rate, error rate. These must not degrade even if primary metric improves
Novelty effect	Declare winner after 3 days	Run for at least 1 full week to let novelty effect wash out. Users click on anything new for 24-48 hours
Variance reduction	Simple mean comparison	Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by 20-50%, shortening experiment duration
Network effects	Treat users as independent	Use cluster-based randomization if users interact (social networks, marketplaces) — standard A/B assigns the same user's friends to different variants, contaminating both groups

Rollback: When and How

Every model deployment needs a rollback plan designed before deployment, not after an incident.

Automatic rollback triggers (no human required):

Latency P99 exceeds 2× baseline for 10 consecutive minutes
Error rate exceeds 1% for 5 consecutive minutes
Prediction volume drops > 30% of expected (upstream breakage)

Human-triggered rollback signals:

Prediction distribution shifts > 2σ from 7-day rolling average
Primary business metric drops > 5% week-over-week with no external explanation
Guard rail metric (complaint rate, refund rate) spikes > 2σ

Rollback mechanism: Traffic routing, not model replacement. Keep both model versions deployed. A rollback is a weight change in the traffic router from new: 100%, old: 0% to new: 0%, old: 100%. This takes < 30 seconds, not a re-deploy. Seldon Core, BentoML, and AWS SageMaker all support traffic weight routing natively.

The 5-minute rollback SLO: Any ML team should be able to restore the champion model within 5 minutes of a rollback decision. If rollback requires a rebuild or re-deploy, it will take 30+ minutes — during which users are getting worse predictions. Blue-green deployment (keep old model warm) is the prerequisite.

Post-rollback diagnosis: Run a diff between the shadow logs from the new model and the champion model for the period when degradation was detected. Look for: which user segments had the largest prediction divergence? Which features had the highest PSI drift right before degradation? This tells you why the new model failed in production when the offline metrics were fine.

⚠ WARNING

Most Common Production ML Evaluation Mistakes

1. Skipping shadow deployment. Going straight from offline evaluation to canary. Shadow deployment catches latency regressions and feature pipeline dependencies that offline evaluation cannot.

2. Stopping the A/B test when p < 0.05. Peeking at results daily and stopping when significance is reached inflates the false positive rate. The false positive rate at p=0.05 with daily peeking over a 2-week test is ~30%, not 5%.

3. No guard rail metrics. Running an A/B where the primary metric (clicks) improves but complaint rate also increases. This is a harmful model that scores well on the wrong metric.

4. Using offline AUC as the A/B primary metric. AUC measures ranking quality on a static holdout set. Online, users react to the ranking, changing future interactions. Always A/B on business metrics, not offline metrics.

5. Decommissioning the old model immediately after rollout. Keep the champion warm for 2 weeks. If something goes wrong, you need a 30-second rollback, not a 30-minute re-deploy.

6. No label-delay-aware monitoring. For fraud, healthcare, or credit risk models, naive accuracy monitoring gives a false "everything is fine" signal for weeks while the model degrades. Always design proxy-label monitoring for high-delay domains.

TIP

Interview Delivery Summary

When asked "how would you deploy and monitor a new ML model?", use the 4-phase lifecycle: Shadow (zero user impact, validate latency + distribution) → Canary (1-5%, real users, automatic rollback) → A/B test (pre-powered, 50/50, business metric primary, guard rails) → Full rollout + champion-challenger.

Then add the three-layer SLO hierarchy (infra → model health → business) and the label delay problem if relevant to the domain. Finish with the 5-minute rollback SLO and traffic-routing vs re-deploy distinction.

The staff signal: bring up guard rail metrics and the label delay problem before the interviewer asks. These are the gaps that distinguish mid-level from staff candidates.

Interview Questions

Click to reveal answers