Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Scenario Walkthrough: Why Is DAU Dropping?
Scenarios
ML Model Deployment Fundamentals: Shipping Safely in Production
ML System Design
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
ML System Design
Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?
Production Engineering
On-Call Incident Response: The First 30 Minutes
Production Engineering
Scenario Walkthrough: Recommendation Model CTR Dropped 15% Overnight
A structured incident walkthrough for sudden ML metric degradation. Learn how to separate feature drift, label drift, training-serving skew, serving regressions, and upstream schema failures with fast falsification checks and executive-safe communication.
What This Scenario Actually Tests
This scenario tests whether you debug ML as a production system, not as a notebook exercise. A 15% CTR drop can originate from model quality, feature freshness, serving latency, candidate retrieval, UI behavior, or even event instrumentation defects. Jumping to retraining is usually the wrong first move.
Interviewers are looking for triage order: verify metric integrity, localize impact, classify failure domain, apply reversible containment, then prove root cause with falsifiable checks. The hard part is making correct decisions under uncertainty while user impact is ongoing.
Strong answers distinguish "model got worse" from "model was never invoked correctly." Many real incidents are routing bugs, stale features, schema mismatches, or fallback-path activation. These look like quality degradation but are operational defects.
Staff-level responses include communication discipline and guardrail design: confidence-scored updates, explicit decision checkpoints, and follow-up controls such as feature contract tests, shadow canaries, and rollback policy tied to quality SLOs.