Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
On-Call Incident Response: The First 30 Minutes
Production Engineering
A/B Test Critique: Finding Flaws in Experiment Designs
Production Engineering
Writing the Blameless Postmortem: RCAs That Actually Drive Change
Production Engineering
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
ML System Design
A/B Testing & Experimentation at Scale
Machine Learning
Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?
Structured methodology for diagnosing sudden metric drops or spikes before escalating. Covers the validation-first approach, driver-tree decomposition, correlated metric analysis, and the discipline of scoping anomalies before jumping to product explanations. Tested at senior and staff IC levels at every major tech company.
The Scenario That Trips Up Most Engineers
It's 9am. Your PM Slacks you: "DAU dropped 15% overnight. We're freaking out. What happened?"
The wrong response: Open the product dashboards and start generating hypotheses about which feature caused it.
The right response: Before touching the product at all, answer one question — is the measurement trustworthy?
The single most common cause of apparent metric drops at mature companies is not a product problem — it's a data pipeline problem. Logging outages, schema changes, deduplication logic shifts, timezone misalignments, and ETL delays cause the majority of "metric anomalies" that page engineers at 2am. Engineers who don't know this spend hours debugging a product that is working fine.
This matters in interviews because senior engineers are expected to validate the signal before investigating the cause. Candidates who jump straight to product hypotheses reveal they have not been on-call in a production environment.
What Interviewers Are Testing
L4/Mid signal: Can you list what you would check? Do you remember to look at recent deploys?
L5/Senior signal: Do you validate the instrumentation before the product? Do you decompose the metric into components to isolate the break? Do you know which correlated signals to check — e.g., if server requests are flat but DAU dropped, the problem is in measurement, not product?
Staff signal: Can you distinguish a data pipeline latency artifact from a true business metric change? Do you know how the metric is computed at the infrastructure level? Do you proactively communicate to stakeholders while investigating rather than going dark?