Scenario-based walkthrough of how senior engineers respond to production incidents — from the first alert through mitigation and communication. Covers blast radius assessment, the investigate-before-fix discipline, escalation decision trees, and the communication cadence that separates engineers who handle incidents well from those who compound them.

45 min read 2 sections 1 interview questions

Incident ResponseOn-CallProduction EngineeringSREObservabilityEscalationRoot Cause AnalysisCommunicationRunbooksPostmortemSLOsLatency DebuggingP99Error Rates

The Scenario

It is 2:17am. Your phone buzzes. PagerDuty: "P99 latency on checkout-service spiked from 80ms to 4,200ms. SLO burn rate: 12x. Alert fired for 3 minutes."

You are the on-call engineer. You have never seen this specific failure before. What do you do in the next 30 minutes?

This is one of the most common senior engineering interview scenarios, and it is asked not just in SRE interviews but in any L5+ role that touches production. The interviewer is not looking for the "correct" answer — they are watching your process. Do you panic? Do you fix first and ask questions later? Do you go silent? Or do you move systematically through a framework, communicate proactively, and make decisions with incomplete information?

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade