Skip to main content
ML System Design·Intermediate

ML Pipelines & Orchestration: Airflow, Kubeflow, and CI/CD for Models

How production ML teams automate the full model lifecycle — from data ingestion through training, evaluation, and deployment. Covers Airflow vs Kubeflow Pipelines, containerized training steps, automated model validation gates, and the CI/CD practices that separate mature ML teams from ad-hoc ones.

25 min read 8 sections 6 interview questions
ML PipelinesAirflowKubeflowOrchestrationCI/CD for MLMLOpsPipeline AutomationDAGContainerizationModel ValidationData ValidationModel DeploymentRetraining Pipeline

What ML Pipelines Are — And Why You Need Them

An ML pipeline is the automated, reproducible sequence of steps that takes raw data and produces a deployed model (see Google's MLOps whitepaper: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning). Without a pipeline, each model training is a manual, unreproducible process: a data scientist runs a Jupyter notebook locally, uploads the model file to a shared drive, and files a ticket for an engineer to deploy it. This process is: slow (days to weeks), error-prone (environment differences, manual steps), unreproducible (which version of the data? which code?), and unscalable (one data scientist is a bottleneck).

With a pipeline: a Git commit triggers automated training. The pipeline fetches the latest data, validates it, trains the model, runs automated evaluation, and either deploys or creates a review task if evaluation fails. End-to-end: 4 hours instead of 2 weeks. Reproducible: every run is versioned. Scalable: multiple models train in parallel.

What every ML pipeline must include:

  • Data ingestion and validation
  • Feature computation (may call the feature store)
  • Model training (potentially distributed)
  • Offline evaluation with automated thresholds
  • Model registration in the model registry
  • Deployment (to shadow, canary, or production)
  • Rollback mechanism if deployed model degrades

The pipeline is not optional. It is the infrastructure that makes 'move fast' in ML actually possible.

Designing an ML Pipeline — Key Decisions and Gates

01

Define the trigger: scheduled vs event-driven vs manual

Scheduled: retraining runs on a calendar cadence (weekly, daily, hourly). Right for stable signals with predictable staleness. Event-driven: retraining triggers on a drift alert, a new data volume threshold, or a business metric drop. Right for production systems where you want to respond to real-world changes. Manual: a team member approves each retrain. Right for high-stakes models (medical, financial decisions) where automated deployment requires extra oversight.

02

Define mandatory data validation gates before training

Every pipeline must validate: (1) schema conformance — do all expected features exist with correct dtypes? (2) completeness — are any critical features missing >5% of values? (3) statistical drift — has any feature distribution shifted beyond threshold from the last training run? (4) volume check — is there enough new data to justify a retrain? If any gate fails, the pipeline halts and pages on-call. Never train on silently corrupted data.

03

Define evaluation gates before deployment

Set minimum offline eval thresholds: 'do not deploy if val_auc drops below 0.85' or 'do not deploy if NDCG@10 is more than 2% worse than the current production model.' These thresholds must be versioned with the pipeline — not a static number, but relative to the current production baseline. Include a comparison against a baseline model (most popular items, linear model) to catch regressions.

04

Design deployment stages: shadow → canary → production

Never deploy directly to 100% traffic. Shadow: new model runs in parallel with production, logs predictions but doesn't serve users — validates latency and infrastructure. Canary: 5–10% of traffic sees the new model, measure business metrics for 48 hours. Production ramp: 25% → 50% → 100% with automated metric checks at each step. Rollback: automated if key metric degrades >2% relative to the control slice.

05

Define rollback mechanism and target recovery time

What is your RTO (Recovery Time Objective) if the deployed model produces bad predictions? Model rollback must be: (1) a single command or automated trigger, (2) instant (serving layer switches the model endpoint, not a retrain), (3) reversible (the previous model artifact is retained for at least 30 days). Document the rollback path before deploying any model to production.

Airflow vs Kubeflow Pipelines — When to Use Each

The two dominant orchestration systems for ML pipelines have fundamentally different design philosophies.

Apache Airflow: DAG-based Python orchestration. Each node in the DAG is an 'operator' — a Python function, Bash command, or Spark job. Airflow handles scheduling (cron-based or trigger-based), retries, dependency management, and monitoring. The execution happens on Airflow workers (Celery or Kubernetes executor), not in containers per step.

Use Airflow for:

  • Complex data pipelines with many interdependent steps
  • Mixed workloads (some ML steps, some SQL transformations, some API calls)
  • Teams already using Airflow for data engineering — unified orchestration
  • When you need rich scheduling semantics (SLA alerts, data sensing, backfills)

Limitation: Airflow steps are not isolated containers. Dependency conflicts between steps (step A needs TensorFlow 2.x, step B needs PyTorch) require workarounds (Docker operators, virtual environments).

Kubeflow Pipelines: Kubernetes-native ML-specific orchestration. Each step runs in its own Docker container. Steps are defined as Python functions decorated with @component. The pipeline is compiled to Kubernetes YAML and runs entirely on K8s.

Use Kubeflow for:

  • ML-specific workflows where each step has different dependency requirements
  • GPU training steps (native K8s GPU scheduling)
  • Teams already running on Kubernetes
  • When full containerization and reproducibility are required
  • Google Vertex AI Pipelines uses Kubeflow Pipelines SDK

The practical reality at most companies: Airflow orchestrates data engineering pipelines (dbt runs, Spark jobs, feature store updates). A Kubeflow/Vertex Pipeline handles the ML-specific steps (training, evaluation, deployment) and is triggered by an Airflow DAG when data is ready. The two systems complement each other.

End-to-End ML Pipeline — Automated Training to Deployment

Rendering diagram...

Kubeflow Pipeline Definition

pythonml_pipeline.py
from kfp import dsl, compiler
from kfp.dsl import component, pipeline, Output, Dataset, Model, Metrics

@component(
    base_image="python:3.11",
    packages_to_install=["pandas", "great-expectations", "boto3"]
)
def validate_data(
    dataset_path: str,
    validation_report: Output[Dataset],
) -> bool:
    """Validates training data quality. Fails pipeline if critical checks fail."""
    import great_expectations as ge
    df = ge.read_parquet(dataset_path)

    results = df.expect_column_values_to_not_be_null("user_id")
    results &= df.expect_column_values_to_be_between("label", 0, 1)
    results &= df.expect_column_pair_values_a_to_be_greater_than_b(
        "event_timestamp", "feature_timestamp"  # No future leakage
    )
    if not results.success:
        raise ValueError(f"Data validation failed: {results}")
    return True

@component(
    base_image="pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime",
    packages_to_install=["mlflow", "scikit-learn"]
)
def train_model(
    training_data: str,
    model_output: Output[Model],
    metrics_output: Output[Metrics],
    learning_rate: float = 0.001,
    epochs: int = 5,
):
    import mlflow
    with mlflow.start_run():
        # Training logic here
        mlflow.log_param("learning_rate", learning_rate)
        mlflow.log_metric("val_auc_pr", val_auc_pr)
        mlflow.pytorch.log_model(model, "model")

@component(base_image="python:3.11")
def evaluate_and_gate(
    model: str,
    test_data: str,
    production_auc_pr: float,
    new_auc_pr: Output[Metrics],
) -> bool:
    """Blocks promotion if new model is worse than production by > 1%."""
    new_auc = compute_auc_pr(model, test_data)
    if new_auc < production_auc_pr - 0.01:  # 1% regression threshold
        raise ValueError(f"New model AUC-PR {new_auc:.3f} < production {production_auc_pr:.3f} - 0.01")
    return True

@pipeline(name="recommendation-model-training-pipeline")
def training_pipeline(dataset_path: str, production_auc_pr: float = 0.82):
    data_task = validate_data(dataset_path=dataset_path)
    train_task = train_model(
        training_data=dataset_path,
    ).after(data_task)
    eval_task = evaluate_and_gate(
        model=train_task.outputs["model_output"],
        test_data=dataset_path,
        production_auc_pr=production_auc_pr,
    ).after(train_task)
    # deploy_task would follow eval_task

Automated Data Validation — The Underinvested Gate

The most common ML pipeline failure is not a bad model — it's a bad training dataset that the pipeline doesn't catch. A downstream data quality issue propagates through training and produces a degraded model that passes offline evaluation (because the evaluation data has the same issue) but fails in production.

What to validate before training:

  1. Schema: all expected columns present, correct types, no new unexpected columns
  2. Null rates: each column's null rate is within expected range (e.g., user_age should never be > 5% null). Alert on > 5 percentage point change from last run.
  3. Value ranges: numeric features within historical min/max (with some buffer). Labels are {0, 1} (or whatever the valid set is). No negative ages, impossible timestamps.
  4. Label distribution: positive rate is within expected range. If fraud rate drops from 1% to ~0.1% (typical fraud rate floor), either fraud decreased (unlikely overnight) or the label pipeline broke.
  5. Future leakage check: event_timestamp < feature_timestamp for all training examples. Point-in-time integrity.
  6. Dataset size: training set is at least N examples. If far below expected, upstream data pipeline failed.

Tools: Great Expectations (Python, declarative expectations with HTML reports), dbt tests (for SQL-based transformations), Evidently AI (drift-specific data validation).

Fail loudly: if data validation fails, stop the pipeline immediately. Alert the on-call team. Do NOT train a model on bad data 'and see what happens.' A model trained on corrupted data is worse than no retrain — it actively degrades production.

ML Pipeline Orchestration — Airflow vs Kubeflow vs Vertex AI Pipelines

DimensionApache AirflowKubeflow PipelinesVertex AI Pipelines
Step isolationShared worker environment; dependency conflicts commonEach step runs in its own Docker container; full isolationEach step in its own container on managed K8s
GPU training supportRequires Docker operator or external job submission; awkwardNative K8s GPU scheduling; first-class GPU supportManaged GPU training jobs with automatic scaling
Mixed workloadsExcellent — SQL operators, Bash, Spark, Python in one DAGML-focused; mixing with non-ML steps is verboseML-focused; non-ML steps use custom container components
Scheduling complexityFull cron + data sensing + backfill; most matureTrigger-based (no native cron); requires external triggerTrigger-based; Cloud Scheduler for cron
ReproducibilityMedium — DAG code is versioned; environment variesHigh — container images are versioned and immutableHigh — managed images + artifact tracking
Learning curveHigh — operators, XComs, providersMedium — Python component decorators, YAML specLow if already on GCP; learning curve to understand artifacts
Best forData engineering teams with mixed ML+ETL pipelinesML teams on Kubernetes needing container isolationGCP-native ML teams wanting managed infrastructure
Production exampleAirbnb, Lyft, LinkedIn data engineering stacksSpotify model training, Kubeflow on GKEGoogle internal teams, Spotify on GCP
TIP

What Interviewers Want to Hear About Pipelines

Most candidates describe training pipelines as 'training the model and deploying it.' Strong candidates describe: (1) automated data validation as a gate before training; (2) a model evaluation gate comparing the new model to the production baseline with a minimum improvement threshold; (3) automated shadow deployment before any canary; (4) rollback capability — keeping the previous model hot and switching traffic back in < 2 minutes if online metrics degrade; (5) the model registry as the gatekeeper between Staging and Production states, with required metadata (training dataset version, evaluation metrics, feature schema version) before a model can be promoted.

Bonus points: mention that the pipeline itself is versioned (Kubeflow pipeline YAML is in Git). Any change to the pipeline code goes through code review, just like model code.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →