Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Message Queues & Streaming: Kafka, Delivery Semantics, and Consumer Groups
High-Level Design
Observability: Metrics, Distributed Tracing, Structured Logging & SLO Design
High-Level Design
Data Partitioning & Sharding: Consistent Hashing, Range Sharding & Hotspot Elimination
High-Level Design
Distributed Systems Debugging: Causality, Partial Failures, and Tracing-Driven Root Cause
Production Engineering
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
ML System Design
Stream Processing Systems: Flink, Kafka Streams, Windows, and Exactly-Once
A system design deep dive on real-time stream processing architecture. Learn how to choose Flink vs Kafka Streams, design windowed aggregations, handle late data, and implement exactly-once semantics with production-grade failure recovery.
Why Stream Processing Is a Distinct HLD Skill
Queueing systems move events; stream processors compute on infinite event flows while preserving business correctness. That distinction is why this topic appears in senior and staff HLD interviews. If a candidate treats stream processing as "just consumers reading Kafka," they miss the core challenge.
The hard part is not parsing messages. It is answering: what is the correct count or balance when events arrive late, out-of-order, or duplicated? A production system that is fast but wrong is usually worse than a slightly slower system that is auditable and correct, especially for billing, fraud, and financial reconciliation paths.
Interviewers are looking for explicit reasoning about event time vs processing time, watermark policy, checkpoint cadence, and sink commit semantics. These are not implementation details; they are correctness contracts. Saying "exactly-once" without explaining state recovery and transactional or idempotent sinks is treated as a red flag.
Staff-level depth adds evolution strategy: how you move from v1 near-real-time dashboards to v2 correctness-critical pipelines, and how you recover safely when schema changes or backfills invalidate prior assumptions.