Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Metric Design for Data Scientists: North Star Metrics, Guardrails, and Causal Attribution
Machine Learning
A/B Testing & Experimentation at Scale
Machine Learning
Causal Inference: DiD, Instrumental Variables, RDD, and When A/B Tests Fail
Machine Learning
Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?
Production Engineering
Root Cause Analysis Framework: Investigating Metric Drops and Production Incidents
The 5-step RCA framework used by senior data scientists at top tech companies — from data quality audit to discriminating hypothesis tests and structured PSA communication. Covers the full diagnostic process for DAU drops, engagement declines, and production anomalies, with worked examples from real interview scenarios.
Why RCA Is the Core Data Science Interview Skill
Metric investigation questions appear in nearly every senior DS interview. They're designed to separate candidates who can run SQL from candidates who can think systematically under pressure.
The failure mode isn't wrong answers — it's wrong sequence. Most 6/10 answers jump straight to product hypotheses: "Maybe a feature broke" or "Maybe there was a competitor launch." They skip the most important check in every investigation: is the data even correct?
Before explaining a metric drop, you must confirm the signal is real. Before forming a hypothesis, you must segment to find where the drop is concentrated. Before concluding, you must design a test that rules out alternatives — not one that merely confirms your preferred hypothesis.
The 5-step framework — Confirm → Segment → Hypothesize → Test → Communicate — is the structure that separates a senior DS from a junior analyst in every metric investigation conversation at Meta, Airbnb, Google, and Lyft.
What Interviewers Evaluate in RCA Questions
Four signals that distinguish a 9/10 RCA answer:
- Data quality first — the best candidates say "I'd check the pipeline before assuming it's a product issue." Most candidates skip this entirely.
- MECE segmentation — they decompose the metric systematically (platform → geography → cohort → surface) without double-counting or missing segments.
- Falsifiable hypotheses — "iOS push notification service failed" is falsifiable (check delivery rate). "Something broke" is not a hypothesis.
- PSA communication — they close with a clear Problem, Signal, Action structure. Not "we're investigating," but "iOS DAU dropped 40%, notification delivery rate is at 0%, we're rolling back the notification service config."
A candidate who says "I'd check whether this is a data quality issue before forming any product hypothesis" will stand out in 80% of interviews.