Sections

How strong engineers design data pipelines that are observable, backfillable, and resilient to schema drift, late data, and silent data quality failures. Covers batch vs. streaming tradeoffs, Lambda vs. Kappa architecture, exactly-once semantics, Airflow DAG design, Flink checkpointing, and the monitoring patterns used by production data teams at Airbnb, LinkedIn, and Netflix.

45 min read 16 sections 6 interview questions

Data PipelinesData EngineeringBatch vs StreamingSchema EvolutionData QualityBackfillObservabilityReliabilityAirflowFlinkKafkaLambda ArchitectureKappa ArchitectureExactly-Once SemanticsSchema Registry

Pipelines Are Products, Not Glue Code

Data pipelines fail when treated as one-off ETL scripts. Production failures are almost never caused by wrong SQL — they are caused by late data that arrives after a window closes, schema drift from an upstream team that added a nullable column without announcing it, silent data quality failures (row counts look fine but 10% of records have NULL user_ids), and unrecoverable state when a streaming job checkpointed corrupted state and cannot replay.

Strong engineers treat pipelines as products with explicit contracts, SLAs, and operational runbooks. The key questions to answer before writing code: Who are the downstream consumers? What is the freshness SLA (15 minutes? 24 hours?)? What is the completeness SLA (99.9% of events, or all-or-nothing)? How do we detect a silent failure? How do we backfill 30 days of historical data if we change the schema?

The candidate gap: A junior engineer writes a Spark job that reads from S3 and writes to a warehouse. A senior engineer defines a data contract, adds quality gates at ingestion, designs for idempotent replay, and instruments freshness and row-count metrics. A staff engineer discusses schema registry integration, exactly-once semantics in Flink, late-data handling with event-time watermarks, and the architectural trade-offs between Lambda and Kappa.

IMPORTANT

What Interviewers Evaluate

6/10: Can describe batch vs. streaming trade-offs. Mentions Spark and Kafka. Does not address data quality, schema evolution, or backfill strategy.

8/10: Defines data contracts. Adds quality checks (completeness, validity). Designs for idempotent writes. Discusses Airflow for orchestration and Kafka for streaming. Can explain why exactly-once is hard.

10/10: Addresses schema evolution via a schema registry (Confluent Schema Registry with Avro). Explains Flink's checkpointing for exactly-once. Discusses late-data handling with event-time watermarks. Knows the trade-offs between Lambda and Kappa architectures and when each is appropriate. Mentions specific observability patterns: data freshness SLOs, row-count anomaly detection, and per-column null rate monitoring. Names production examples: Airbnb's Minerva, LinkedIn's DataHub, Netflix's Metacat.

Pipeline Design Checklist — Before Writing Any Code

Define the data contract

Ownership, schema, freshness SLA, completeness SLA, and downstream consumers. A contract without an SLA is not a contract — it is a hope.

Choose the execution model

Ask: what is the freshness SLA? >1 hour → batch (Spark). 1–60 minutes → micro-batch (Spark Structured Streaming). <1 minute → streaming (Flink). Most 'real-time' requirements are micro-batch once you push on the actual business need.

Register the schema in Confluent Schema Registry with BACKWARD compatibility before any producer deploys. Any BACKWARD-incompatible change is blocked in CI — not caught at 3am by a deserialization exception.

Add quality gates

Completeness (row count vs. expected), validity (null rate per critical column < 0.1%), uniqueness (dedup on event_id), timeliness (data arrived within SLA). Fail the pipeline before writing to downstream tables if checks fail.

Design for idempotent replay

Every pipeline must be safely re-runnable. Use INSERT OVERWRITE PARTITION, not INSERT INTO. For streaming, use idempotent sinks or exactly-once semantics. If re-running produces duplicates, the pipeline is not production-ready.

Instrument freshness and anomaly alerts

Freshness lag (current_time - max(event_time)), row count anomaly (±10% vs. 7-day median), consumer group lag (for streaming). Alert before downstream consumers run their morning jobs.

Batch vs. Streaming vs. Micro-Batch — Choosing the Right Model

The choice of execution model is the first and most consequential pipeline design decision. Most teams default to batch out of familiarity, then retrofit streaming when freshness requirements tighten — which is the wrong order.

Batch (Spark, Hive, dbt): Processes a bounded dataset (yesterday's events, the last hour of logs) on a schedule. Strengths: simple to reason about, cheap to operate, easy to backfill (re-run the job with a different date partition), and Spark's optimizer produces efficient execution plans for large joins. Weaknesses: data is stale by definition. A nightly batch job produces metrics 24 hours late. For ML feature pipelines serving real-time inference, this is unacceptable.

Streaming (Flink, Kafka Streams, Spark Structured Streaming): Processes an unbounded event stream with millisecond-to-second latency. Strengths: low end-to-end latency, natural fit for event-driven architectures. Weaknesses: operationally complex (checkpointing, exactly-once semantics, state management), harder to backfill (requires replaying from Kafka's retention window or a separate historical source), and joins across streams require windowing with associated late-data risk.

Micro-batch (Spark Structured Streaming in trigger interval mode): Processes small batches every 1–5 minutes. The sweet spot for teams that need near-real-time freshness but want Spark's familiar APIs and SQL support. Latency is higher than true streaming (1–5 min vs. milliseconds) but dramatically simpler to operate. Uber's analytics platform uses micro-batch for most ML feature pipelines where 5-minute freshness is sufficient.

Decision rule: Ask "what is the freshness SLA?" If > 1 hour: batch. If 1–60 minutes: micro-batch. If < 1 minute: streaming. The majority of "we need real-time" requirements are actually micro-batch once you push on the actual business need.

Execution Model Comparison

Model	Latency	Complexity	Backfill	Best For	Real Systems
Batch (Spark)	Hours	Low	Easy — rerun with date partition	Daily analytics, feature backfill, dbt models	Airbnb Minerva, Netflix Keystone batch
Micro-batch (Spark SS)	1–5 min	Medium	Medium — replay from Kafka	Near-real-time features, dashboards	Uber platform, Databricks Delta Live Tables
Streaming (Flink)	Milliseconds	High	Hard — replay + state rebuild	Fraud detection, real-time recommendations, alerting	LinkedIn, Lyft, DoorDash stream processing
Kafka Streams	Milliseconds	Medium	Hard — consumer group reset	Stateful enrichment co-located with Kafka	Confluent-native pipelines, microservice event processing

Lambda Architecture vs. Kappa Architecture

Many production data platforms are built on one of two architectural patterns for serving both real-time and historical queries.

Lambda architecture maintains two separate processing paths: a batch layer (Spark jobs over the full historical dataset, producing accurate but stale metrics) and a speed layer (Flink or Kafka Streams over the live event stream, producing fast but potentially approximate metrics). A serving layer merges results from both. The batch layer periodically overwrites the speed layer's output, correcting any approximations.

Lambda's strength: the batch layer produces accurate numbers from complete data; the speed layer provides freshness. Its weakness: two codebases for the same business logic. When the definition of "daily active users" changes, both the Spark job and the Flink job must be updated, tested, and deployed in sync. LinkedIn's early data platform suffered from this — business logic drifted between the two layers, producing different numbers from the "same" metric depending on which layer you queried.

Kappa architecture eliminates the batch layer. All processing goes through the stream processor (Flink). Historical reprocessing is done by replaying from Kafka (requires long retention — 30–90 days — or a separate event store like S3). The same pipeline code processes both live events and replayed historical events.

Kappa's strength: one codebase, one definition of truth. Its weakness: Kafka retention for 30+ days is expensive, and reprocessing 30 days of data through Flink takes significant cluster time. For exploratory analytics (ad-hoc SQL on historical data), Kappa requires materializing results to a query-friendly store (Iceberg, Delta Lake).

In practice: Most large organizations use a hybrid. Flink for real-time signals, Spark for historical backfill and complex joins, unified business logic in a shared library, and Iceberg/Delta as the unifying table format that both can write and read. This is what Airbnb's Minerva metric platform does.

Production Pipeline Architecture — Batch + Streaming with Quality Gates

Rendering diagram...

Data Quality — Validation at Ingestion

Silent data quality failures are the most dangerous pipeline failure mode because they don't page anyone. The row count looks fine; the latency is normal; but 15% of user_id fields are NULL because a mobile SDK update sent malformed events. Downstream ML models silently degrade.

Four mandatory quality dimensions at ingestion:

1. Completeness: Are all expected records present? For event pipelines: compare the event count in Kafka vs. the count written to the warehouse. A >2% gap triggers an alert. For batch pipelines: compare today's partition row count against the 7-day moving average. A >5% deviation is anomalous.

2. Validity: Do values conform to the schema and business rules? Null rate per critical column (null user_id should be <0.1%), enum value coverage (all event_type values must be in the known set), referential integrity (every restaurant_id in order events must exist in the restaurant table).

3. Uniqueness: Are there duplicate records? Event deduplication requires a unique event ID. COUNT(*) vs COUNT(DISTINCT event_id) should differ by <0.01%. Duplicates cause double-counting in metrics and can corrupt ML training labels.

4. Timeliness: Did the data arrive within the SLA? For a batch job that should produce output by 6am, a late-arrival monitor alerts at 6:15am if the Iceberg partition is not yet written.

Implementation — Great Expectations or custom: LinkedIn's DataHub and Airbnb's Minerva both use a quality framework layered over Spark jobs. Each quality check is a function that returns (passed: bool, metric_value: float, threshold: float). Failing checks surface in a quality dashboard and, for critical pipelines, fail the Airflow DAG before writing to the downstream table — catching bad data before it contaminates downstream consumers.

Schema Evolution — Schema Registry and Compatibility Modes

Schema drift is the second most common cause of pipeline failures after late data. An upstream service adds a new required field to an Avro schema. The Flink consumer that expects the old schema cannot deserialize the new message. The job throws SchemaDeserializationException at 3am. The on-call engineer wakes up to a 4-hour backlog.

Schema registry solves this by versioning schemas and enforcing compatibility at publish time. Confluent Schema Registry stores schemas by subject (typically <topic>-value). Every Kafka producer registers its schema before publishing; every consumer fetches the schema by ID embedded in the message header.

Three compatibility modes:

BACKWARD (recommended): New schema can read data written with the old schema. Consumers can upgrade first, then producers. Safe for adding optional fields with defaults. Does NOT allow removing fields or making optional fields required.
FORWARD: Old schema can read data written with the new schema. Producers can upgrade first. Allows removing fields (consumers ignore unknown fields).
FULL: Both BACKWARD and FORWARD. Most restrictive. Required for Avro schemas shared across teams with independent release cycles.

Schema evolution rules for Avro:

✅ Add a field with a default value ("default": null)
✅ Remove a field that had a default value
❌ Remove a required field (breaks backward compatibility)
❌ Rename a field (treated as remove + add)
❌ Change a field's type (e.g., int → long without union)

CI enforcement: Add a schema compatibility check to the CI pipeline. Before merging a schema change, the CI job registers the proposed schema against the registry in dry-run mode. If it fails the compatibility check, the merge is blocked. This prevents 3am incidents from schema drift — a pattern used by LinkedIn for their internal Avro schemas.

Exactly-Once Semantics — Why It Is Hard and How Flink Solves It

Streaming pipelines have three delivery guarantees: at-most-once (fast, may lose events), at-least-once (reliable, may duplicate events), and exactly-once (no loss, no duplication — the hardest). Most teams think they need exactly-once; in practice, idempotent at-least-once is sufficient for most use cases and much simpler to operate.

Why at-least-once is the common default: When a Flink job fails and restarts from a checkpoint, it replays events from the Kafka offset recorded in the checkpoint. Any events processed between the last checkpoint and the failure are reprocessed — causing duplicates. If the sink (e.g., Iceberg, Kafka output topic) is idempotent (writing the same event twice produces the same result), at-least-once is functionally exactly-once.

Flink's exactly-once with two-phase commit: Flink achieves true exactly-once by coordinating checkpointing with the sink. The Kafka sink uses a transactional producer. When Flink initiates a checkpoint:

Flink pauses processing and sends a checkpoint barrier through the event stream
Each operator snapshots its state to durable storage (S3 or HDFS)
The Kafka sink pre-commits its transactional producer (writes are buffered, not yet visible to consumers)
Once all operators confirm their checkpoint is durable, the Kafka sink commits the transaction — records become visible
Flink records the Kafka output offset and the input offset together in the checkpoint

On failure, Flink restores state from the last checkpoint. The Kafka output transaction for any incomplete checkpoint is aborted — its records were never visible. The input is replayed from the checkpointed offset. Result: exactly once, end to end.

Checkpointing configuration in Flink:

env.enable_checkpointing(60_000)  # checkpoint every 60 seconds
env.get_checkpoint_config().set_checkpoint_storage("s3://my-bucket/checkpoints")
env.get_checkpoint_config().set_min_pause_between_checkpoints(30_000)
env.get_checkpoint_config().set_max_concurrent_checkpoints(1)

The checkpoint interval is a latency/durability trade-off: shorter intervals mean less data to replay on failure but more checkpoint overhead.

When NOT to use exactly-once: The two-phase commit adds latency (records are not visible until the next checkpoint commit) and reduces throughput by ~10–20%. For use cases like clickstream analytics where a 0.1% duplication rate is acceptable, exactly-once is overkill. Use at-least-once with an idempotent sink instead.

Backfill Strategies — Replaying Historical Data Safely

Backfills are the most dangerous operation in a data engineering pipeline because they write to partitions that downstream consumers are already reading. A poorly designed backfill corrupts production data.

Partition-based backfill (batch): For Spark pipelines writing to Iceberg or Hive, backfill by reprocessing one partition (typically one day) at a time. The Airflow DAG accepts a start_date and end_date parameter and loops over partitions sequentially. Write to a staging location first (s3://data/staging/table/event_date=2025-01-01), validate quality, then atomically swap into the production location with ALTER TABLE ... ADD PARTITION.

Shadow write pattern: For critical tables, run the new pipeline in parallel with the old pipeline for 7–14 days. Compare row counts, key metrics, and sample records between the two outputs. Only cut over to the new pipeline when you have high confidence in its correctness. Netflix uses shadow writes for all major pipeline migrations.

Idempotency is the prerequisite: A backfill job must be safely re-runnable. If re-running the job writes duplicate rows, the backfill causes more damage than it fixes. Design for idempotency: use INSERT OVERWRITE PARTITION (not INSERT INTO) so re-running the job replaces the partition rather than appending to it.

Streaming backfill from Kafka: If Kafka retention is sufficient (7–30 days), replay the consumer group from the earliest offset. For historical data older than Kafka retention, replay from S3 archive using the same Flink job with a FileSource connector — the pipeline code is identical, only the source changes. This is the Kappa architecture advantage.

Coordinating with downstream consumers: Notify downstream consumers before starting a backfill on a partition they depend on. If a downstream Airflow DAG reads the same Iceberg table partition, it may read a partially-written partition during backfill. Use Iceberg's snapshot isolation: the downstream consumer reads the snapshot that existed before the backfill started, while the backfill writes a new snapshot. The downstream consumer upgrades to the new snapshot only after the backfill is complete and validated.

Monitoring — Pipeline SLAs, Freshness, and Anomaly Detection

Freshness lag is the most important pipeline health metric. Define it as: freshness_lag = current_time - max(event_time) in latest partition. For a pipeline with a 30-minute SLA, alert if freshness_lag > 45 minutes. This catches: pipeline failures, source slowdowns, and Kafka consumer lag accumulation.

Row count anomaly detection: Day-over-day row count should be within a defined band (typically ±10% for stable data, ±30% for event-driven data). Compute the 7-day median row count for each partition and alert if the current partition is outside [median × 0.7, median × 1.3]. This catches silent failures where a pipeline ran but produced 10% of expected output (e.g., a filter bug).

Per-column null rate monitoring: Track null_rate for every non-nullable business key (user_id, order_id, event_type) in every partition. Alert if null_rate rises above 0.1%. A sudden jump from 0.01% to 5% null user_ids indicates a source schema change or a bug in the enrichment step.

Airflow SLA miss callbacks: Airflow's sla parameter on a task triggers a callback if the task doesn't complete within the defined window. Use this to page the on-call engineer before downstream consumers start their morning jobs:

t = SparkSubmitOperator(
    task_id="daily_orders_agg",
    sla=timedelta(hours=2),
    sla_miss_callback=pagerduty_alert,
)

Kafka consumer group lag: For streaming pipelines, monitor consumer group lag (messages in Kafka not yet consumed). A growing lag means the Flink job is falling behind. Alert if lag exceeds 5 × average_throughput_per_minute (i.e., more than 5 minutes of unprocessed events).

Common Pipeline Failures and Mitigations

Failure Mode	Detection Signal	Mitigation	Prevention
Late data (events arrive after window closes)	Freshness lag spike; row count drop in window	Extend watermark tolerance; reprocess with a wider window	Set watermark = P99 event delay, not P50
Duplicate data	COUNT(*) >> COUNT(DISTINCT event_id)	Dedup on event_id in Spark/Flink; idempotent sink writes	Embed unique event_id at source; validate uniqueness at ingestion
Schema drift (upstream adds/removes field)	Deserialization exception; null_rate spike on new field	Schema registry compatibility check blocks invalid schema	Enforce BACKWARD compatibility in CI; consumer reads unknown fields as null
Partition read during backfill write	Downstream query returns partial results	Iceberg snapshot isolation; write to new snapshot, commit atomically	Shadow write pattern; notify consumers before backfill
Corrupted Flink checkpoint	Job restarts loop; cannot restore from checkpoint	Fall back to earlier checkpoint; replay from Kafka offset	Retain last N checkpoints; validate checkpoint integrity post-write
Airflow task upstream failure	Downstream tasks run on stale data	Airflow DAG dependencies + SLA miss alerts	Use `trigger_rule=all_success` (default); never use `all_done`

Airflow DAG — Idempotent Daily Batch Pipeline

pythonorders_daily_dag.py

from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.python import PythonOperator
from airflow.utils.trigger_rule import TriggerRule

# Idempotent batch pipeline: safe to backfill by setting start_date + end_date
default_args = {
    "owner": "data-eng",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
}

with DAG(
    dag_id="orders_daily_aggregate",
    default_args=default_args,
    schedule_interval="0 3 * * *",    # 3am UTC, after upstream events close
    start_date=datetime(2024, 1, 1),
    catchup=True,                      # enables backfill via start_date/end_date
    max_active_runs=3,                 # allow 3 parallel backfill runs
    sla_miss_callback=lambda *args: send_pagerduty_alert("orders_daily SLA missed"),
) as dag:

    # Step 1: Run Spark job — writes to staging partition, then swaps
    # Partition pruning: WHERE event_date = '{{ ds }}' uses only 1 partition
    spark_job = SparkSubmitOperator(
        task_id="compute_daily_orders",
        application="s3://pipelines/orders_aggregate.py",
        application_args=[
            "--date", "{{ ds }}",           # Airflow macros are idempotent
            "--output", "s3://data/staging/orders_daily/event_date={{ ds }}",
        ],
        conf={
            "spark.sql.sources.partitionOverwriteMode": "dynamic",  # overwrite single partition
            "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
        },
        sla=timedelta(hours=2),
    )

    # Step 2: Quality gate — fail DAG before promoting if checks fail
    quality_check = PythonOperator(
        task_id="quality_gate",
        python_callable=run_quality_checks,
        op_kwargs={"date": "{{ ds }}", "table": "orders_daily"},
    )

    # Step 3: Promote staging → production (atomic Iceberg snapshot commit)
    promote = PythonOperator(
        task_id="promote_partition",
        python_callable=promote_to_production,
        op_kwargs={"date": "{{ ds }}"},
        trigger_rule=TriggerRule.ALL_SUCCESS,  # never promote on quality failure
    )

    spark_job >> quality_check >> promote


def run_quality_checks(date: str, table: str) -> None:
    """Fail fast if data quality checks don't pass — before promotion."""
    from great_expectations import get_context
    context = get_context()
    result = context.run_checkpoint(
        checkpoint_name=f"{table}_quality",
        batch_request={"datasource_name": "staging", "data_asset_name": date},
    )
    if not result.success:
        raise ValueError(f"Quality checks failed for {table} on {date}: {result}")

⚠ WARNING

Production Pitfalls

1. Not setting event-time watermarks. Processing-time windowing produces wrong results when events arrive out of order. Always use event-time + a watermark set to the P99 event delay (not P50). A watermark that's too tight drops late events; one that's too loose delays output. LinkedIn found that most of their "missing data" incidents were watermarks set at P50, which dropped 10% of events.

2. Using INSERT INTO instead of INSERT OVERWRITE for batch backfills. Re-running a batch job with INSERT INTO appends rows, causing duplicates. Always use INSERT OVERWRITE PARTITION or Iceberg's MERGE INTO for idempotent writes. A backfill that creates duplicates is worse than no backfill.

3. No quality gate before downstream tables. Writing bad data to a warehouse table is recoverable. Writing it to a feature store that serves ML models causes silent model degradation that may not be noticed for days. Always interpose a quality gate that blocks the write if null rates or row counts are anomalous.

4. Checkpoint interval too long. A 10-minute Flink checkpoint interval means up to 10 minutes of replay on failure. During replay, no new output is produced — the Kafka consumer lag grows. On a high-throughput job (500K events/sec), 10 minutes of replay means replaying 300M events. Use 60–120 second intervals for production jobs.

5. Treating Airflow failures as one-off incidents. Each Airflow task failure should trigger a post-mortem question: "Could this have been caught by a quality check?" Most data engineering incidents are preventable by better monitoring, not by heroic incident response.

TIP

Interview Delivery Summary

Lead with the freshness SLA question: "Before choosing batch or streaming, I need to know the freshness SLA. If it's >1 hour, batch is simpler and cheaper. If it's 1–60 minutes, micro-batch with Spark Structured Streaming. If it's <1 minute, we need Flink. Most 'real-time' requirements are actually micro-batch once you push on the business need."

State the quality gate requirement early: "I always add quality checks before writing to any downstream table — completeness, validity, uniqueness, and timeliness. A pipeline that produces data on time but with 10% NULL user_ids is worse than no pipeline, because it silently corrupts downstream consumers."

Schema evolution signal: "I register all schemas in Confluent Schema Registry with BACKWARD compatibility. Any schema change that breaks backward compatibility is blocked in CI — not caught at 3am by a deserialization exception in Flink."

Staff signal: "For exactly-once semantics, I'd use Flink's two-phase commit with a transactional Kafka sink. But I'd first ask whether at-least-once with an idempotent sink is sufficient — it almost always is and avoids the 10–20% throughput penalty of two-phase commit. The key question is: does re-processing an event produce the same output? If yes, exactly-once is unnecessary complexity."

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.