Sections
Related Guides
Feature Stores: Online/Offline Architecture & Training-Serving Consistency
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
ML Pipelines & Orchestration: Airflow, Kubeflow, and CI/CD for Models
ML System Design
ML System Design: 6-Step Framework
ML System Design
Data Pipelines for ML: Batch, Streaming, and Event Architecture
How production ML data pipelines are actually built — Kafka for event collection, Spark for batch feature engineering, Flink for real-time aggregations, and the architectural decisions that determine whether your model trains on fresh or stale data.
Why ML Data Pipelines Are Different from Regular ETL
Standard ETL moves data from A to B on a schedule. ML data pipelines do that and must guarantee three additional properties that traditional data engineering ignores:
Point-in-time correctness: Features used for training must reflect the world state at prediction time — not today's state. A standard ETL pipeline overwrites the latest value. An ML pipeline must preserve historical values for retrospective training.
Label alignment: You need to join features to labels, but labels often arrive with delay (fraud chargebacks take 30 days, downstream conversion takes 7 days). The pipeline must handle the temporal gap between when the prediction was made and when the label arrives.
Training-serving consistency: The same transformation logic must run in both the batch training path and the real-time serving path. Different codepaths → training-serving skew.
These requirements force a specific architecture: Kafka for event collection → Flink for real-time aggregations → offline store for historical features → point-in-time join for training data generation.
Clarifying Questions Before Designing an ML Data Pipeline
What is the feature freshness requirement?
Batch features (daily Spark jobs) work for low-velocity signals like user demographics or item embeddings. Streaming features (Flink with <1 min lag) are required for fraud, notifications, and real-time ranking. Your answer determines whether you need Kafka+Flink or just Spark.
What is the label delay and what is the training frequency?
Fraud chargebacks take 30 days. Ad conversions take 7 days. Notification clicks arrive within 1 hour. If label delay > training cadence, you need a pending-label queue and a finalization job, not just an ETL pipeline.
What scale? (events/sec and training data volume)
A few hundred events/sec → Kinesis or Kafka with 1-2 Flink jobs. 10K+ events/sec → Kafka cluster with partitioned Flink topology. Training data in terabytes/day → Spark on HDFS/S3 with partitioned parquet. Training data in petabytes → distributed Spark cluster with columnar storage (Delta Lake / Iceberg).
Is training-serving consistency critical for this feature?
For a recommendation system, serving features computed from aggregation windows (session_click_count_1h) that differ from training computation will cause silent accuracy degradation. Flag features that require byte-identical computation between training and serving, and confirm these will be managed through a shared Feature Store rather than separate code paths.
What is the tolerance for data loss vs latency?
Kafka default is at-least-once delivery — you may get duplicate events. Exactly-once delivery requires idempotent consumers and increases latency by 2-5×. For ML training data, at-least-once with deduplication is almost always the right tradeoff — a small number of duplicate training examples doesn't hurt model quality.
ML Data Pipeline Architecture — From Events to Training Data
Kafka — The Event Bus Everything Connects To
Kafka is the foundational piece of every production ML data pipeline because it decouples producers (apps) from consumers (Flink, Spark, S3 sinks, monitoring) without any of them knowing about each other. This matters for ML because you can add a new consumer (a new model's feature pipeline) without touching the app.
Key configuration decisions:
Topic partitioning: Partition by entity_id (user_id, item_id) so events for the same user always go to the same partition. This is critical for Flink stateful aggregations — Flink reads a partition and maintains state per user. If events for the same user scatter across partitions, each Flink task would need the full user state, which doesn't scale.
Retention period: Default 7 days is usually insufficient for ML. Use 30–90 days if your S3 sink fails, so you can replay. This is your disaster recovery for the raw event log.
Schema Registry: Every topic must have a schema (Avro or Protobuf), registered in Confluent Schema Registry. This prevents producers from silently breaking downstream consumers with schema changes. Schema Registry enforces backward/forward compatibility rules before a producer can publish.
Throughput: A single Kafka broker handles ~1M messages/second at 1KB messages. For ML event streams (clicks, views, purchases), a 3-broker cluster is typical for a mid-scale company (100M events/day).
Spark vs Flink — When to Use Each
The most common confusion in ML data pipeline design: candidates use Spark for everything, including jobs that should be Flink, or vice versa.
Use Spark for:
- Daily/hourly batch feature engineering (compute user's 30-day purchase history, join multiple tables)
- Training dataset generation (point-in-time joins over petabytes)
- Historical backfills (recomputing features for 2 years of data)
- Spark's strength: massive parallelism for bounded datasets, rich SQL support, DataFrames, MLlib integration
Use Flink for:
- Continuous real-time aggregations (transactions in last 5 minutes, session-level features)
- Features that must be fresh within 30 seconds to 1 minute
- Exactly-once semantics for stateful computations (Flink's strongest guarantee vs Spark Streaming's at-least-once)
- Event-time processing with out-of-order event handling (Watermarks)
The key difference: Flink is truly streaming — it processes each event as it arrives and maintains persistent state. Spark Structured Streaming processes micro-batches (100ms–30s windows). For features needing sub-minute freshness with exactly-once guarantees, Flink is the correct choice. For everything else, Spark is simpler to operate and has better tooling for batch workloads.
What most candidates get wrong: treating Flink as 'faster Spark.' Flink maintains state between events (e.g., 'number of transactions this hour' updates on every transaction). Spark micro-batch recomputes from scratch every batch interval. For large state, Flink is far more efficient.
Batch vs Streaming Trade-offs for ML Feature Pipelines
| Dimension | Batch (Spark) | Micro-batch (Spark Streaming) | True Streaming (Flink) |
|---|---|---|---|
| Feature freshness | Hours to days | Minutes (100ms–30s batches) | Seconds to sub-minute |
| Consistency guarantee | Exactly-once (bounded data) | At-least-once (default) | Exactly-once (with checkpointing) |
| State management | Stateless per batch | Limited (stateful with caveats) | Rich stateful (RocksDB backend) |
| Operational complexity | Low — batch jobs are simple | Medium | High — checkpoint tuning, backpressure management |
| Cost profile | Cheap during off-peak, spiky | Steady, moderate | Steady, moderate to high |
| Best for | Training dataset generation, daily features | Near-real-time features, 1–5 min freshness | Velocity features, sub-minute freshness, fraud detection |
The Label Delay Problem — Why Your Training Data Is Always Stale
Most ML pipelines generate training data immediately after an event occurs. But ground truth labels almost always arrive with a delay:
- Fraud: chargeback takes 30–90 days to resolve
- Ad conversion: purchase may happen 7 days after click
- Content recommendation: explicit ratings are sparse; watch time as implicit label is collected within a session but engagement signal develops over hours
If you train on data from the last 3 days and your labels take 7 days to arrive, you're training on mostly unlabeled or incorrectly labeled data. Fix: implement a 'label collection delay' in your training pipeline. Don't use examples until labels have had time to arrive. For fraud: train on events from 90+ days ago, whose labels are now fully resolved. For ad conversion: use a 7-day attribution window — only count conversions within 7 days of the click as positive labels.
Alternatively: use surrogate labels that arrive immediately (click, add-to-cart) as proxies for the delayed label (purchase). But be careful — optimizing click rate can hurt purchase rate (clickbait problem).
Data Quality — The Unglamorous Problem That Kills Models
Bad data is the most common cause of production ML failures, but it's the topic candidates skip in interviews because it's not architecturally glamorous.
Silent schema changes: A mobile app release changes an event field from string to integer. The Schema Registry rejects it if you've configured backward compatibility rules. Without Schema Registry: the Flink job silently reads null for all future events from that field, and a feature that was 99% populated becomes 0% populated overnight. Your model's recall drops. No error in your logs.
Null rates: Monitor the null rate of every feature. Sudden jumps in null rate (e.g., user_age goes from 2% null to 30% null) indicate upstream data quality issues. Set alerts: if null rate for a feature changes by > 5 percentage points vs 7-day rolling average, trigger an investigation.
Event deduplication: Mobile events are sent with retry logic. The same event may arrive 2–3 times. Without deduplication, velocity features are inflated. Solution: every event has a unique event_id. Flink deduplicates by maintaining a bloom filter of seen event_ids in the last 24 hours (Flink's CEP or a side output). Alternatively, Kafka's idempotent producer at-exactly-once guarantees deduplication at the broker level if using transactions.
Distribution drift in training data: Features computed on January data may have very different distributions in July (seasonality, product changes). Monitor training data distribution before kicking off a training run. If P90 of transaction_amount shifts by > 20%, investigate before training.
Flink Real-Time Velocity Feature Pipeline
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows
from pyflink.common.time import Time
from pyflink.datastream.functions import ReduceFunction
class TransactionCountReducer(ReduceFunction):
"""Count transactions in sliding window, keyed by card_id."""
def reduce(self, value1, value2):
return {
"card_id": value1["card_id"],
"count": value1["count"] + value2["count"],
"total_amount": value1["total_amount"] + value2["total_amount"],
}
def build_velocity_pipeline(env: StreamExecutionEnvironment):
# Kafka source: partition by card_id so state is co-located with events
transaction_stream = env.add_source(
KafkaSource("transactions_topic", partition_key="card_id")
)
# 5-minute sliding window with 30-second slides
# efSearch: produces updated features every 30s, covering last 5 min
velocity_features = (
transaction_stream
.assign_timestamps_and_watermarks(
WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(10))
)
.map(lambda e: {"card_id": e["card_id"], "count": 1, "total_amount": e["amount"]})
.key_by("card_id")
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(30)))
.reduce(TransactionCountReducer())
)
# Sink to Redis (online store) and S3 (offline store for point-in-time joins)
velocity_features.add_sink(RedisSink(key_prefix="velocity:"))
velocity_features.add_sink(S3ParquetSink("s3://features/velocity/"))
return velocity_features
# NOTE: Checkpoint interval should be 1-5 min with RocksDB state backend
# for exactly-once. At-least-once is NOT acceptable for deduplication use cases.
Interview Signal: What Separates Good Candidates
Most candidates describe a generic 'data pipeline.' Strong candidates address: (1) How does the pipeline handle late-arriving events? (Flink watermarks with a bounded out-of-orderness of 10–30 seconds). (2) How do you ensure exactly-once processing for velocity features so fraud scores aren't inflated by duplicate events? (Flink checkpointing + Kafka transactions). (3) How do you generate training data without leaking future information? (Point-in-time joins using the offline store's historical values). (4) How does the serving feature computation stay consistent with the training feature computation? (Same feature definition registered in the feature store, executed by the same framework). Mention these four points and you've covered what 95% of candidates miss.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →