Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?
Production Engineering
Writing the Blameless Postmortem: RCAs That Actually Drive Change
Production Engineering
A/B Test Critique: Finding Flaws in Experiment Designs
Production Engineering
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
ML System Design
On-Call Incident Response: The First 30 Minutes
Scenario-based walkthrough of how senior engineers respond to production incidents — from the first alert through mitigation and communication. Covers blast radius assessment, the investigate-before-fix discipline, escalation decision trees, and the communication cadence that separates engineers who handle incidents well from those who compound them.
The Scenario
It is 2:17am. Your phone buzzes. PagerDuty: "P99 latency on checkout-service spiked from 80ms to 4,200ms. SLO burn rate: 12x. Alert fired for 3 minutes."
You are the on-call engineer. You have never seen this specific failure before. What do you do in the next 30 minutes?
This is one of the most common senior engineering interview scenarios, and it is asked not just in SRE interviews but in any L5+ role that touches production. The interviewer is not looking for the "correct" answer — they are watching your process. Do you panic? Do you fix first and ask questions later? Do you go silent? Or do you move systematically through a framework, communicate proactively, and make decisions with incomplete information?