Design a production real-time anomaly detection system for metrics, logs, and business KPIs — end-to-end. Covers the three gaps in most answers: pointwise z-scores miss *multivariate* failures, unsupervised models cause *alert fatigue* without severity and incident context, and streaming state (Flink keyed windows) must respect *event time* and *exactly-once* semantics for financial or SRE use cases. Includes Isolation Forest + robust baselines, suppression and correlation grouping, and wiring to paging with SLO burn.

55 min read 3 sections 1 interview questions

Anomaly DetectionApache FlinkKafkaStreaming MLIsolation ForestZ-ScoreAlert FatigueSeasonalityPrometheusSREChange PointWelfordMultivariate OutlierOnline Evaluation

Why Real-Time Anomaly Detection Is Not 'Train Isolation Forest on CSV'

Volume and velocity — Telemetry scales to millions of time-ordered events per minute. Batch nightly jobs discover outages after revenue and SLO burn. Streaming detection must sit on the hot path of the observability pipeline: Kafka (or Pulsar) → stateful processor (Flink, Spark Structured Streaming, ksqlDB) → alert sink (PagerDuty, Opsgenie) with seconds of detection lag, not hours.

Non-stationarity — Normal for a metric changes: daily/weekly seasonality, releases, marketing campaigns, holidays. A static threshold on "CPU > 80%" Pages every Black Friday. Production systems separate seasonal decomposition or rolling robust baselines from residual scoring. Change-point methods address mean shift; spectral ideas appear for periodic data.

Multivariate structure — A single metric under its own control limit can look fine while a joint pattern is impossible (order rate up, payment success down, cart size flat — classic payment-rail degradation). Univariate z-scores per feature miss this; Isolation Forest, vector autoregression residuals, or correlation break detectors address correlated failure modes.

Product risk: alert fatigue — Unsupervised detectors are sensitive; without suppression, deduplication, correlation groups, and severity routing, on-call quits. The ML problem is inseparable from incident workflow design. Industry writeups (e.g. Confluent on ML_DETECT_ANOMALIES in Flink SQL) emphasize continuous stream-native detection as the antidote to wait for the warehouse patterns.

IMPORTANT

What Interviewers Are Evaluating

Mid-level: Knows z-score and Isolation Forest. Proposes 'Kafka + model'.

Senior-level: Event time vs processing time, keyed state, window types, warm-up, false positive control, Isolation Forest retrain cadence, synthetic canaries, links to SLO / error budget.

Staff-level: Multivariate and graph anomaly; propensity of alert storms; cost of false positive vs MTTD; feedback from on-call (alert labels) into semi-supervised ranking; coordination with tracing (OpenTelemetry) for root-cause context not just score.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade