Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?

Structured methodology for diagnosing sudden metric drops or spikes before escalating. Covers the validation-first approach, driver-tree decomposition, correlated metric analysis, and the discipline of scoping anomalies before jumping to product explanations. Tested at senior and staff IC levels at every major tech company.

40 min read 3 sections 1 interview questions
Metric DebuggingData QualityIncident ResponseDriver Tree AnalysisData PipelinesInstrumentationProduction DebuggingOn-callObservabilityData LiteracyAnomaly DetectionETLRoot Cause Analysis

The Scenario That Trips Up Most Engineers

It's 9am. Your PM Slacks you: "DAU dropped 15% overnight. We're freaking out. What happened?"

The wrong response: Open the product dashboards and start generating hypotheses about which feature caused it.

The right response: Before touching the product at all, answer one question — is the measurement trustworthy?

The single most common cause of apparent metric drops at mature companies is not a product problem — it's a data pipeline problem. Logging outages, schema changes, deduplication logic shifts, timezone misalignments, and ETL delays cause the majority of "metric anomalies" that page engineers at 2am. Engineers who don't know this spend hours debugging a product that is working fine.

This matters in interviews because senior engineers are expected to validate the signal before investigating the cause. Candidates who jump straight to product hypotheses reveal they have not been on-call in a production environment.

TIP

What Interviewers Are Testing

L4/Mid signal: Can you list what you would check? Do you remember to look at recent deploys?

L5/Senior signal: Do you validate the instrumentation before the product? Do you decompose the metric into components to isolate the break? Do you know which correlated signals to check — e.g., if server requests are flat but DAU dropped, the problem is in measurement, not product?

Staff signal: Can you distinguish a data pipeline latency artifact from a true business metric change? Do you know how the metric is computed at the infrastructure level? Do you proactively communicate to stakeholders while investigating rather than going dark?

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.