Sections

0/9

Related Guides

Product Metrics & North Star: How Engineers Define and Own Success

Engineering Craft

35m

Distributed Systems Patterns

High-Level Design

40m

Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets

ML System Design

30m

← Back to Library

Engineering Craft·Intermediate

CI/CD Pipelines: Designing Safe, Fast Delivery for ML and SDE Systems

How to design a CI/CD pipeline from scratch — test pyramid structure, artifact promotion strategies, deployment patterns (blue-green, canary, feature flags), and the specific tradeoffs that apply to ML model serving. Tested in platform engineering, SRE, and senior SDE interviews. Distinct from debugging a broken pipeline in production — this guide covers proactive pipeline architecture.

40 min read 9 sections 6 interview questions

CI/CDContinuous IntegrationContinuous DeploymentPipeline DesignBlue-Green DeploymentCanary ReleaseFeature FlagsTest PyramidTrunk-Based DevelopmentML DeploymentDevOpsPlatform EngineeringZero-Downtime Deploy

The Gap Between Using CI/CD and Designing It

Most engineers use a CI/CD system every day and can't explain how to build one. They know how to push code and watch the green checkmark — they don't know why the pipeline is structured the way it is, what the tradeoffs were, or how to fix it when it becomes a bottleneck. This gap is what platform engineering and senior SDE interviews expose.

The scope of this guide: How to design a CI/CD system architecture — choosing stages, structuring artifact promotion, selecting deployment strategies, and handling the ML-specific complications (model versioning, data validation, shadow deployment). The emphasis is on design decisions and their tradeoffs, not on syntax of any specific tool (GitHub Actions, Jenkins, ArgoCD).

The boundary with Production Engineering (the prod track on this platform): Prod covers what to do when a deploy breaks a metric in production — how to roll back, how to triage, how to structure an incident. This guide covers how to design the pipeline so that the blast radius of a bad deploy is small and rollbacks are fast. Pre-production architecture, not post-deployment incident response.

IMPORTANT

What Interviewers Test in CI/CD Questions

A 6/10 answer: "We'd have a build stage, test stage, and deploy stage." Describes the obvious pipeline shape without any design reasoning.

A 9/10 answer: Justifies the test pyramid structure, explains the parallelism strategy for fast feedback, names a specific deployment strategy with its tradeoffs, addresses the stateful service problem (DB migrations), and — if the context is ML — explains how model validation gates differ from code test gates. Shows you've debugged a slow pipeline and fixed it.

CI/CD Pipeline Design: The Five Decisions

Test Pyramid Structure

Decide the ratio of unit : integration : E2E tests. Industry benchmark: 70:20:10. Unit tests run in milliseconds, parallelized across cores. Integration tests run in seconds, constrained by test database setup. E2E tests run in minutes, often the pipeline bottleneck. The design question: where do you draw the gate? Fast-failing on unit tests before integration tests saves 80% of wasted CI compute.

Artifact Promotion Strategy

Code is built once and the artifact (container image, wheel, binary) is promoted through environments — dev → staging → production. Never build twice from the same commit. Promotion means the same artifact passes gates at each stage. This is the discipline that prevents 'works in staging, breaks in prod' from build environment differences.

Environment Parity and Ephemeral Environments

Staging drifts from production over time. The fix: infrastructure-as-code (Terraform/Pulumi) for both, same container images, same secrets management. Ephemeral environments (spin up a full stack per PR, tear down on merge) eliminate staging drift entirely for integration tests but increase pipeline cost ~3–5x.

Deployment Strategy

Choose based on risk tolerance and infrastructure complexity: Blue-Green (full parallel environment, instant switch, 2x infrastructure cost), Canary (gradual traffic shift, requires traffic splitting infra, catches issues with small blast radius), Feature Flags (code deployed everywhere, visibility controlled at runtime, most flexible but adds code complexity). These are not mutually exclusive — most mature systems use all three for different risk tiers.

Rollback and Recovery Design

Rollback must be designed before deployment, not after an incident. Questions to answer upfront: Can you roll back a database migration? (usually no — design forward-only migrations). Can you roll back a feature flag? (yes, always). What is your target MTTR for a bad deploy? (if <5 min, you need automated rollback on error rate spike).

CI/CD Pipeline Architecture: Service + Model Deployment

Rendering diagram...

The Test Pyramid in Practice: Why It Breaks and How to Fix It

Most CI pipelines invert the test pyramid over time — they accumulate integration and E2E tests because they're easier to write for complex systems, and unit test coverage atrophies. The result: a 45-minute pipeline that fails nondeterministically and gives engineers no fast feedback loop.

Symptoms of an inverted pyramid:

Pipeline takes >20 minutes to fail on a syntax error (unit gate is too late in the pipeline)
Flaky test rate >5% (E2E tests are brittle because they depend on real infrastructure)
Engineers disable tests locally to iterate faster (the pipeline has become an obstacle)

The fix involves three changes:

First, front-load fast gates. Lint, type checking, and unit tests should be the first thing that runs, in parallel, before anything else. If they take <3 minutes and fail 30% of PRs, you've saved 30% of your downstream compute on those PRs.

Second, parallelize ruthlessly. Integration tests can be sharded by test file across multiple runners. GitHub Actions supports matrix strategies. A test suite that takes 12 minutes sequentially often runs in 4 minutes across 3 runners. The cost is marginal on CI platforms — the bottleneck is usually that teams haven't configured sharding.

Third, quarantine flaky tests. A flaky test is not a test — it's noise. Create a @flaky tag and route those tests to a separate job that doesn't block deployment. Measure flakiness rate per test file (flaky rate = failures on green code / total runs). Any test above 2% flakiness gets quarantined within one sprint.

Deployment Strategy Decision Matrix

Strategy	How It Works	Best For	Requires	Key Risk
Blue-Green	Two identical environments; switch load balancer from old (blue) to new (green)	Stateless services, fast rollback requirement, tight MTTR SLO	2x infrastructure, load balancer with instant cutover	DB migration compatibility between blue and green must be maintained simultaneously
Canary	Route N% of traffic to new version, monitor, gradually increase	High-traffic services where blast radius matters, ML model updates	Traffic splitting infra (Nginx, Envoy, or service mesh)	Harder to debug: logs/metrics are split across two versions during rollout
Feature Flags	Code deployed everywhere; feature visibility controlled by flag service	Risky features, A/B testing at deploy time, decoupling deploy from release	Flag management service (LaunchDarkly, Unleash, or homegrown), SDK in app code	Flag technical debt accumulates; stale flags add code complexity
Rolling	Replace instances one-by-one (K8s default), old and new versions live simultaneously	Kubernetes-native services, acceptable to have mixed versions briefly	K8s deployment config, readiness probes that actually work	Mixed-version state during rollout can expose API compatibility bugs

The Stateful Service Problem: Database Migrations

Stateless services are straightforward to deploy with any of the above strategies. Stateful services — anything with a database — are hard because you cannot roll back a destructive schema migration.

The rule that solves most problems: design migrations as expand-then-contract, never as atomic changes.

A migration that renames a column cannot be done safely in one deploy. The sequence is:

Expand: Add the new column (new_name), write to both old and new in code, read from old. Deploy.
Migrate: Backfill new_name from old_name for all existing rows (can be done live with batched updates).
Switch: Update reads to use new_name. Deploy.
Contract: Drop old_name. Deploy.

Four deploys for one rename. This is the cost of safe migrations in a live system. Any tool or engineer who tells you it can be done in one deploy is either running a write-locked migration (downtime) or gambling on the migration completing before any code reads the old column.

For ML systems specifically: Model schema migrations (changing feature vector schema, changing output format) are even harder because the model artifact and serving code must be co-deployed. Use API versioning on the serving endpoint (/v1/predict, /v2/predict) so old model and new model can serve simultaneously during canary.

ML-Specific CI/CD: What's Different

Standard software CI/CD validates code correctness — does the function return the right answer? ML CI/CD must additionally validate model quality — does the model produce better predictions than the current production model? These are fundamentally different gate types.

Code gates (apply to both software and ML): unit tests, linting, type checking, contract tests.

ML-specific gates:

Offline validation: The new model must beat the current production model on a held-out evaluation set by at least X% on the primary metric (AUC, NDCG, RMSE depending on task). The threshold X is set conservatively enough that you're confident the improvement is real, not noise. Typical: 0.5–1% for mature models where baseline is already high.

Shadow mode: The new model is deployed and receives real traffic, but its outputs are not served to users — they're logged alongside the live model's outputs. Compare the distributions. Does the new model behave differently on tail cases? Does latency profile change? Shadow mode runs for 24–72 hours before canary begins.

Canary validation criteria: For ML, canary success is not just error rate and latency — it also includes online business metrics (CTR, conversion, session length depending on product). These move slower than error rates, so canary windows for ML models are typically 24–48 hours, versus 1–4 hours for pure software deploys.

TIP

Interview Summary: What to Say and What Impresses

The answer that passes: Describes a three-stage pipeline (build, test, deploy) with blue-green or canary. Gets to the next round.

The answer that impresses: (1) Justifies test pyramid ratio and explains how to fix an inverted pyramid, (2) distinguishes deployment strategies by when you'd choose each and why, (3) addresses the stateful service / DB migration problem with expand-then-contract, (4) if ML context: adds shadow mode as a gate before canary and explains why offline metrics alone are insufficient for production promotion.

The boundary to draw clearly: "This covers how I'd design the pipeline architecture proactively. Debugging what went wrong when a specific deploy broke production metrics is a different process — that's incident triage, which involves log correlation, metric decomposition, and rollback decision logic."

Interview Questions

Click to reveal answers