Production Engineering

Production Engineering40 min

6 questions

Code Review Excellence: The Craft That Most Engineers Never Learn

A concrete methodology for giving and receiving code review that improves code quality, accelerates team growth, and does not create adversarial dynamics. Covers the review hierarchy (correctness before style), how to write feedback that gets implemented, the approve-vs-block decision framework, reviewing for future maintainers, and the anti-patterns — LGTM culture, nitpick spirals, ego-driven rejections — that poison engineering team velocity.

Code ReviewEngineering CraftTechnical LeadershipFeedback+10

Production Engineering50 min

What Good Code Actually Looks Like: Engineering Craft Beyond the Linter

A practical framework for the six dimensions of code quality that interviewers and senior engineers actually evaluate: correctness, clarity, testability, error handling, performance awareness, and maintainability. Covers naming as design, the rule of three, when to comment and what to say, the hidden cost of cleverness, and the before/after transformations that turn adequate code into production-grade code.

Code QualityEngineering CraftClean CodeNaming+10

Production Engineering34 min

GitHub End-to-End Workflow: From Issue to Safe Production Merge

A production-grade GitHub workflow for software teams: issue scoping, branch strategy, pull request quality gates, CI policy, review dynamics, and merge/release discipline. Learn what interviewers expect when they ask how you ship safely at speed in modern engineering organizations.

GitHub WorkflowPull RequestsBranching StrategyGitHub Actions+8

Production Engineering40 min

Metric Anomaly Triage: Is This a Real Problem or an Instrumentation Bug?

Structured methodology for diagnosing sudden metric drops or spikes before escalating. Covers the validation-first approach, driver-tree decomposition, correlated metric analysis, and the discipline of scoping anomalies before jumping to product explanations. Tested at senior and staff IC levels at every major tech company.

Metric DebuggingData QualityIncident ResponseDriver Tree Analysis+9

Production Engineering55 min

Performance Profiling & Optimization: Python, JVM, Flame Graphs, and APM

End-to-end methodology for profiling production services — from Python cProfile and py-spy to JVM Flight Recorder and async-profiler, reading Brendan Gregg flame graphs, and correlating distributed traces in Datadog and OpenTelemetry. Covers CPU-bound vs I/O-bound identification, memory leak diagnosis, and the non-obvious insights that separate senior engineers from candidates who only know the tools exist.

Performance ProfilingPython cProfilepy-spymemory_profiler+10

Feature FlagsFeature TogglesKill SwitchDark Launch+10

Feature Flags: Safe Rollouts, Kill Switches, and the Dark Launch Pattern

A concrete framework for using feature flags as a production engineering tool — not just a product experimentation mechanism. Covers flag taxonomy (release flags vs. experiment flags vs. permission flags vs. ops flags), gradual rollout mechanics, kill switch design, the dark launch pattern for validating system behavior before user-facing traffic, flag cleanup debt, and the technical architecture of a flag evaluation system that adds under 1ms of latency.

KubernetesProduction OperationsCanary DeploymentBlue Green+6

Kubernetes Operations in Production: Safe Rollouts, Resource Controls, and Cluster Guardrails

Day-2 Kubernetes operations for production systems. Learn rollout strategies, readiness/liveness probe design, resource requests vs limits, RBAC boundaries, and PodDisruptionBudget safeguards used by strong platform teams.

Incident ResponseOn-CallProduction EngineeringSRE+10

On-Call Incident Response: The First 30 Minutes

Scenario-based walkthrough of how senior engineers respond to production incidents — from the first alert through mitigation and communication. Covers blast radius assessment, the investigate-before-fix discipline, escalation decision trees, and the communication cadence that separates engineers who handle incidents well from those who compound them.

PostmortemRCARoot Cause AnalysisIncident Response+10

Writing the Blameless Postmortem: RCAs That Actually Drive Change

A concrete methodology for writing postmortems that prevent recurrence instead of assigning blame. Covers causal chain construction, distinguishing proximate causes from systemic contributing factors, the 5 Whys failure modes, action item quality standards, and how to run the postmortem meeting without it collapsing into a blame session. The skill that separates engineers who learn from incidents from those who repeat them.

Production Engineering60 min

Security for Engineers: OWASP, Secrets, Supply Chain, and Least Privilege

Security fundamentals for engineers: OWASP Top 10, secrets management with HashiCorp Vault and AWS Secrets Manager, supply chain hardening via SBOM and Sigstore/cosign, and least-privilege IAM with IRSA workload identity. Covers SSRF, SQL injection, IDOR, and mTLS.

OWASP Top 10Supply Chain SecurityHashiCorp VaultAWS Secrets Manager+10

Production Engineering32 min

8 questions

Capacity Planning for Production Systems

A practical framework for forecasting load, setting headroom, and scaling capacity ahead of incidents. Covers demand modeling, uncertainty bands, and cost-reliability tradeoffs.

Capacity PlanningForecastingSREHeadroom+4

Production Engineering48 min

5 questions

Cloud Cost Optimization: From Runaway Bills to Unit Economics

Senior engineer playbook for cloud cost optimization: EC2 spot vs. reserved instances, S3 lifecycle tiers, inter-AZ egress, Karpenter bin packing, and FinOps chargeback. Real numbers from Netflix and Lyft. Essential for FAANG infrastructure interviews.

Cloud Cost OptimizationAWS Reserved InstancesEC2 Spot InstancesAWS Savings Plans+10

Production Engineering30 min

6 questions

Technical Debt Triage: Prioritizing Fixes That Reduce Real Risk

A practical prioritization framework for technical debt: classify debt by risk and business impact, sequence remediation, and avoid roadmap derailment. Includes debt portfolio management patterns.

Technical DebtPrioritizationRisk ManagementEngineering Strategy+4

Distributed DebuggingCausalityPartial FailureClock Skew+5

Distributed Systems Debugging: Causality, Partial Failures, and Tracing-Driven Root Cause

A practical debugging framework for distributed production incidents. Covers happens-before reasoning, clock skew pitfalls, partition diagnosis, cascading failure patterns, and trace-first root cause workflows.

Production Engineering50 min

A/B Test Critique: Finding Flaws in Experiment Designs

Scenario-based exercises for identifying the seven most common A/B experiment design failures: insufficient power, wrong duration, novelty effects, network effects contaminating the control group, multiple testing without correction, surrogate metric selection, and holdout contamination. Tested at senior IC and staff levels at Meta, Google, Netflix, Airbnb, and Booking.com.

A/B TestingExperimentationStatistical PowerNovelty Effect+10

Production Engineering50 min

SLO Design: Error Budgets, Burn Rate Alerts, and the Reliability Tradeoff

A concrete methodology for designing Service Level Objectives that balance reliability and velocity. Covers the SLI → SLO → SLA hierarchy, error budget arithmetic, burn rate alerting (the system Google uses in production), multi-window alert design, the reliability vs. feature velocity tradeoff, and the common SLO design mistakes that cause alert fatigue or miss real incidents. A staff-level differentiator in SRE and senior infrastructure interviews.

SLOSLASLIError Budget+11

Cloud Native12-FactorStateless ServicesMulti Region+6

Cloud-Native Production Patterns: Stateless Services, Regions, and Cost-Aware Resilience

A production engineering guide to cloud-native patterns that matter in interviews and real systems. Covers 12-factor constraints, stateless vs stateful boundaries, active-active vs active-passive, spot strategy, and egress-aware architecture decisions.

Engineering JudgmentTechnical DebtRefactoringSystem Migration+10

Rewrite vs. Refactor: How to Make the Call Without Destroying the Business

Framework for evaluating whether a legacy service should be rewritten from scratch or incrementally improved. Covers second system syndrome, the strangler fig pattern, hidden functionality risk, how to estimate rewrite cost realistically, and the conditions where a rewrite is genuinely the right choice. A classic senior and staff engineering judgment question.

Production Engineering38 min

System Migration in Production: Zero-Downtime Strategy and Risk Control

How to migrate critical systems safely in production using dual-write, backfill, shadow reads, and progressive cutover. Focuses on rollback design, data correctness, and organizational execution.

System MigrationDual WriteBackfillCutover+4