Skip to main content
ML System Design·Intermediate

Data Pipelines for ML: Batch, Streaming, and Event Architecture

How production ML data pipelines are actually built — Kafka for event collection, Spark for batch feature engineering, Flink for real-time aggregations, and the architectural decisions that determine whether your model trains on fresh or stale data.

30 min read 10 sections 6 interview questions
KafkaSparkFlinkData PipelineBatch ProcessingStream ProcessingEvent LoggingData QualitySchema RegistryETLFeature EngineeringLambda Architecture

Why ML Data Pipelines Are Different from Regular ETL

Standard ETL moves data from A to B on a schedule. ML data pipelines do that and must guarantee three additional properties that traditional data engineering ignores:

Point-in-time correctness: Features used for training must reflect the world state at prediction time — not today's state. A standard ETL pipeline overwrites the latest value. An ML pipeline must preserve historical values for retrospective training.

Label alignment: You need to join features to labels, but labels often arrive with delay (fraud chargebacks take 30 days, downstream conversion takes 7 days). The pipeline must handle the temporal gap between when the prediction was made and when the label arrives.

Training-serving consistency: The same transformation logic must run in both the batch training path and the real-time serving path. Different codepaths → training-serving skew.

These requirements force a specific architecture: Kafka for event collection → Flink for real-time aggregations → offline store for historical features → point-in-time join for training data generation.

Clarifying Questions Before Designing an ML Data Pipeline

01

What is the feature freshness requirement?

Batch features (daily Spark jobs) work for low-velocity signals like user demographics or item embeddings. Streaming features (Flink with <1 min lag) are required for fraud, notifications, and real-time ranking. Your answer determines whether you need Kafka+Flink or just Spark.

02

What is the label delay and what is the training frequency?

Fraud chargebacks take 30 days. Ad conversions take 7 days. Notification clicks arrive within 1 hour. If label delay > training cadence, you need a pending-label queue and a finalization job, not just an ETL pipeline.

03

What scale? (events/sec and training data volume)

A few hundred events/sec → Kinesis or Kafka with 1-2 Flink jobs. 10K+ events/sec → Kafka cluster with partitioned Flink topology. Training data in terabytes/day → Spark on HDFS/S3 with partitioned parquet. Training data in petabytes → distributed Spark cluster with columnar storage (Delta Lake / Iceberg).

04

Is training-serving consistency critical for this feature?

For a recommendation system, serving features computed from aggregation windows (session_click_count_1h) that differ from training computation will cause silent accuracy degradation. Flag features that require byte-identical computation between training and serving, and confirm these will be managed through a shared Feature Store rather than separate code paths.

05

What is the tolerance for data loss vs latency?

Kafka default is at-least-once delivery — you may get duplicate events. Exactly-once delivery requires idempotent consumers and increases latency by 2-5×. For ML training data, at-least-once with deduplication is almost always the right tradeoff — a small number of duplicate training examples doesn't hurt model quality.

ML Data Pipeline Architecture — From Events to Training Data

Rendering diagram...

Kafka — The Event Bus Everything Connects To

Kafka is the foundational piece of every production ML data pipeline because it decouples producers (apps) from consumers (Flink, Spark, S3 sinks, monitoring) without any of them knowing about each other. This matters for ML because you can add a new consumer (a new model's feature pipeline) without touching the app.

Key configuration decisions:

Topic partitioning: Partition by entity_id (user_id, item_id) so events for the same user always go to the same partition. This is critical for Flink stateful aggregations — Flink reads a partition and maintains state per user. If events for the same user scatter across partitions, each Flink task would need the full user state, which doesn't scale.

Retention period: Default 7 days is usually insufficient for ML. Use 30–90 days if your S3 sink fails, so you can replay. This is your disaster recovery for the raw event log.

Schema Registry: Every topic must have a schema (Avro or Protobuf), registered in Confluent Schema Registry. This prevents producers from silently breaking downstream consumers with schema changes. Schema Registry enforces backward/forward compatibility rules before a producer can publish.

Throughput: A single Kafka broker handles ~1M messages/second at 1KB messages. For ML event streams (clicks, views, purchases), a 3-broker cluster is typical for a mid-scale company (100M events/day).

Batch vs Streaming Trade-offs for ML Feature Pipelines

DimensionBatch (Spark)Micro-batch (Spark Streaming)True Streaming (Flink)
Feature freshnessHours to daysMinutes (100ms–30s batches)Seconds to sub-minute
Consistency guaranteeExactly-once (bounded data)At-least-once (default)Exactly-once (with checkpointing)
State managementStateless per batchLimited (stateful with caveats)Rich stateful (RocksDB backend)
Operational complexityLow — batch jobs are simpleMediumHigh — checkpoint tuning, backpressure management
Cost profileCheap during off-peak, spikySteady, moderateSteady, moderate to high
Best forTraining dataset generation, daily featuresNear-real-time features, 1–5 min freshnessVelocity features, sub-minute freshness, fraud detection
⚠ WARNING

The Label Delay Problem — Why Your Training Data Is Always Stale

Most ML pipelines generate training data immediately after an event occurs. But ground truth labels almost always arrive with a delay:

  • Fraud: chargeback takes 30–90 days to resolve
  • Ad conversion: purchase may happen 7 days after click
  • Content recommendation: explicit ratings are sparse; watch time as implicit label is collected within a session but engagement signal develops over hours

If you train on data from the last 3 days and your labels take 7 days to arrive, you're training on mostly unlabeled or incorrectly labeled data. Fix: implement a 'label collection delay' in your training pipeline. Don't use examples until labels have had time to arrive. For fraud: train on events from 90+ days ago, whose labels are now fully resolved. For ad conversion: use a 7-day attribution window — only count conversions within 7 days of the click as positive labels.

Alternatively: use surrogate labels that arrive immediately (click, add-to-cart) as proxies for the delayed label (purchase). But be careful — optimizing click rate can hurt purchase rate (clickbait problem).

Data Quality — The Unglamorous Problem That Kills Models

Bad data is the most common cause of production ML failures, but it's the topic candidates skip in interviews because it's not architecturally glamorous.

Silent schema changes: A mobile app release changes an event field from string to integer. The Schema Registry rejects it if you've configured backward compatibility rules. Without Schema Registry: the Flink job silently reads null for all future events from that field, and a feature that was 99% populated becomes 0% populated overnight. Your model's recall drops. No error in your logs.

Null rates: Monitor the null rate of every feature. Sudden jumps in null rate (e.g., user_age goes from 2% null to 30% null) indicate upstream data quality issues. Set alerts: if null rate for a feature changes by > 5 percentage points vs 7-day rolling average, trigger an investigation.

Event deduplication: Mobile events are sent with retry logic. The same event may arrive 2–3 times. Without deduplication, velocity features are inflated. Solution: every event has a unique event_id. Flink deduplicates by maintaining a bloom filter of seen event_ids in the last 24 hours (Flink's CEP or a side output). Alternatively, Kafka's idempotent producer at-exactly-once guarantees deduplication at the broker level if using transactions.

Distribution drift in training data: Features computed on January data may have very different distributions in July (seasonality, product changes). Monitor training data distribution before kicking off a training run. If P90 of transaction_amount shifts by > 20%, investigate before training.

TIP

Interview Signal: What Separates Good Candidates

Most candidates describe a generic 'data pipeline.' Strong candidates address: (1) How does the pipeline handle late-arriving events? (Flink watermarks with a bounded out-of-orderness of 10–30 seconds). (2) How do you ensure exactly-once processing for velocity features so fraud scores aren't inflated by duplicate events? (Flink checkpointing + Kafka transactions). (3) How do you generate training data without leaking future information? (Point-in-time joins using the offline store's historical values). (4) How does the serving feature computation stay consistent with the training feature computation? (Same feature definition registered in the feature store, executed by the same framework). Mention these four points and you've covered what 95% of candidates miss.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →