Sections

0/9

Related Guides

GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization

ML System Design

30m

Experiment Tracking & Model Registry: The Version Control for ML

ML System Design

25m

A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls

ML System Design

35m

ML Monitoring & Drift Detection: Keeping Models Healthy in Production

ML System Design

30m

Quiz

← Back to Library

ML System Design·Intermediate

Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets

How to design the serving layer for ML models in production — when to use batch pre-computation vs real-time inference, how to safely deploy new models via shadow and canary patterns, and how to structure a multi-stage serving pipeline within a latency budget.

30 min read 9 sections 6 interview questions

Model ServingReal-Time InferenceBatch ServingShadow DeploymentCanary DeploymentLatency BudgetTorchServeTritonMulti-Stage PipelineBlue-Green DeploymentModel RollbackSLA

Batch vs Real-Time Serving — The First Decision

The single most important architectural decision in ML serving: does the user's experience tolerate pre-computed predictions, or does the model need to score fresh inputs at request time? (See Chip Huyen's ML Systems Design: https://huyenchip.com/machine-learning-systems-design/toc.html)

Batch serving (pre-computation): Run the model offline over all entities (users, items) and store predictions in a database. At request time, just do a key-value lookup. Latency: < 1ms (DB lookup). Freshness: stale by hours or days.

Use batch serving when:

Predictions don't depend on the specific request context ('top-10 items for user X' is the same whether they request at 9am or 9:01am)
Model is expensive (transformer-based) but candidates are bounded
Latency budget is tight and you can't afford real-time inference
Example: email recommendation, weekly playlist generation, pre-computed similar items

Real-time serving (on-demand inference): Score inputs at the moment of the request, using the live request context. Latency: 10–200ms. Freshness: perfect — uses the user's current state.

Use real-time when:

Predictions depend on live context (current query, current cart state, live fraud signals)
Personalization must reflect very recent actions (user just clicked on sneakers → real-time recommendation should surface sneakers)
Example: search ranking, fraud scoring, real-time ad targeting

The hybrid pattern (most production systems): Pre-compute user and item embeddings offline (batch). At serving time, run the ANN search and ranking stages in real-time using those embeddings. You get freshness where it matters (ranking uses live signals) without running the expensive encoder at request time.

Batch vs Real-Time Serving Decision and Architecture

Rendering diagram...

Serving Architecture Decision — Key Questions to Ask First

What is the required latency (P99 SLA)?

< 50ms: you need pre-computed results cached in Redis, or a very small model running in-process. 50–200ms: real-time inference with feature fetch + small model is feasible. 200ms–2s: larger models on GPU, with potential for precomputed components. > 2s: batch serving or async (compute results in background, surface on next page load) is acceptable and dramatically cheaper.

How often do predictions need to be updated?

Predictions valid for hours (newsletter recommendations, weekly content digest): batch inference, compute offline, store in Redis/DynamoDB, read at serve time. Predictions that depend on real-time context (fraud, live ranking with session features, search): only real-time serving works — you cannot precompute because the input changes per request.

What is the model size and computational cost?

Lightweight models (LightGBM, small MLP): CPU serving at 5–20ms, no GPU needed, scale horizontally. Large models (transformer ranker, LLM): GPU serving required, higher infrastructure cost, must use batching (static or continuous) to amortize GPU overhead. Hybrid: run light model on CPU for latency-critical path; heavy model offline for batch scoring.

What deployment risk are you willing to accept at rollout?

Zero risk tolerance: shadow deployment first (parallel scoring, no user impact). Low risk: canary to 1–5% traffic, monitor for 48 hours. Standard: 10% canary for 24 hours, then ramp. The shadow → canary → ramp sequence is industry standard. Blue-green is for systems where instantaneous traffic switch is preferred over gradual ramp (rare in ML).

Deployment Strategies — How to Ship Without Breaking Production

Deploying a new model version is the highest-risk operation in ML engineering. The model has passed offline evaluation but might behave differently in production due to distribution shift, feature differences, or unexpected user behavior patterns. Three strategies manage this risk:

Shadow deployment (zero risk): The new model runs in parallel with the production model, receiving the same traffic. Both models score every request, but only the production model's predictions are served to users. Shadow model logs its predictions for comparison. Use this to: verify prediction consistency, catch feature encoding bugs, measure latency of new model under real load, compare prediction distributions. Duration: 24–72 hours. If shadow model predictions are statistically similar to production and latency is within budget → promote to canary.

Canary deployment (low risk): Route a small fraction of traffic (1–5%) to the new model. Real users are affected. Monitor online metrics (CTR, conversion, latency, error rate) for the canary group vs control. If metrics are neutral or positive after 48–72 hours → expand to 10% → 25% → 50% → 100%. Rollback trigger: any guard-rail metric (revenue, error rate, p99 latency) degrades beyond pre-defined threshold.

Blue-green deployment (clean rollback): Two identical serving environments. 'Blue' serves production traffic. 'Green' runs the new model. Switch traffic from blue to green instantaneously at the load balancer. Rollback: switch back to blue (< 1 minute). Downside: requires 2× serving infrastructure during transition.

The shadow → canary → full rollout sequence is standard at mature ML teams. Never go directly to full rollout without at least a canary.

Serving Architecture Patterns — When to Use Each

Pattern	Latency	Freshness	Cost	Best Use Case	Production Example
Batch precomputation (cache + read)	< 5ms read	Hours to days	Low	Recommendations that don't depend on real-time context: email digest, weekly playlists	Netflix 'Top Picks' row, Spotify Discover Weekly
Real-time CPU inference	10–50ms	Per-request	Medium	Lightweight models with real-time context: fraud scoring, personalized ranking with fresh features	Stripe fraud (GBDT), Twitter light ranker
Real-time GPU inference	20–100ms	Per-request	High	Large neural rankers, embedding generation, LLM completion	YouTube deep ranker, GPT-4.1 API
Async/nearline inference	1–60 sec (buffered)	Near-real-time	Medium	Notification ranking where slight delay is acceptable; batch triggered by events	LinkedIn notification pipeline, Airbnb pricing updates
Disaggregated prefill-decode (LLM)	TTFT < 500ms, streaming	Per-request	Very high	High-throughput LLM with mixed short/long prompts where TTFT and TPOT need separate optimization	vLLM disaggregated, Google Gemini serving

⚠ WARNING

The Silent Serving Bug: Feature Mismatch at Deployment

The most common failure when deploying a new model: the serving code uses a different feature encoding than the training code. Model was trained with user_age = (current_year - birth_year). Serving computes user_age = current_year - birth_year - 1 (off-by-one in birthday calculation). The model silently serves degraded predictions — no error, no exception.

Shadow deployment catches this because you can compare the shadow model's prediction distribution to your offline evaluation predictions on the same inputs. If they diverge, there's a serving bug.

Permanent fix: feature store with registered feature definitions. The model was trained using features from the store. The serving pipeline fetches features from the same store using the same definitions. The encoding is never re-implemented independently.

Latency Budget Decomposition — Designing for End-to-End SLAs

Every production ML serving system has a latency budget — the maximum time the user will wait before seeing a result. The serving architecture must be designed to fit within it.

A real-time recommendation system with a 200ms user-visible budget:

Stage	Component	Typical Latency	Notes
Network	Client → CDN → Load balancer	10–20ms	Geography-dependent
Feature fetch	Redis online store (10 features)	2–5ms	Batch fetch, not serial
ANN retrieval	FAISS IVF on 500M items → 1000 candidates	~25–35ms	Dominant stage
Pre-ranker	LightGBM on 1000 candidates, CPU	10–20ms	CPU parallelism
Neural ranker	Transformer on 200 candidates, GPU	40–80ms	Dominant stage
Post-processing	Diversity, freshness rules	2–5ms
Network return	Service → client	10–20ms

Total: ~100–185ms. Within 200ms budget with headroom.

Common mistake: designing each stage in isolation without summing. A 'reasonable' 50ms per stage × 5 stages = 250ms — over budget before accounting for network.

The cascade rule: each stage in a multi-stage pipeline should reduce the candidate set so the next stage runs on fewer (more expensive) computations. If the neural ranker can score 1000 candidates in 80ms, that's acceptable. If you skip the pre-ranker and send 1000 candidates directly to the neural ranker, you're running the most expensive stage at full scale — fine if your budget allows, but verify with load tests.

Multi-Stage Serving Pipeline with Latency Tracking

pythonserving_pipeline.py

import time
from dataclasses import dataclass
from typing import List

@dataclass
class ServingMetrics:
    feature_fetch_ms: float = 0
    retrieval_ms: float = 0
    pre_rank_ms: float = 0
    rank_ms: float = 0
    total_ms: float = 0

class RecommendationPipeline:
    LATENCY_BUDGET_MS = 150  # Internal budget (200ms user-visible - 50ms network)
    RETRIEVAL_CANDIDATES = 1000
    PRE_RANK_CANDIDATES = 200
    FINAL_K = 20

    def __init__(self, feature_store, ann_index, pre_ranker, neural_ranker):
        self.feature_store = feature_store
        self.ann_index = ann_index
        self.pre_ranker = pre_ranker
        self.neural_ranker = neural_ranker

    def serve(self, user_id: str, context: dict) -> tuple[List[str], ServingMetrics]:
        metrics = ServingMetrics()
        start = time.perf_counter()

        # Stage 1: Feature fetch (batch call, not serial per feature)
        t0 = time.perf_counter()
        user_features = self.feature_store.get_online_features(
            entity_id=user_id,
            feature_names=["user_embedding", "user_history_30d", "user_session"]
        )
        metrics.feature_fetch_ms = (time.perf_counter() - t0) * 1000

        # Stage 2: ANN retrieval — 500M items → 1000 candidates
        t0 = time.perf_counter()
        candidates = self.ann_index.search(
            query=user_features["user_embedding"],
            k=self.RETRIEVAL_CANDIDATES
        )
        metrics.retrieval_ms = (time.perf_counter() - t0) * 1000

        # Stage 3: Pre-ranker — 1000 → 200 (CPU, fast)
        t0 = time.perf_counter()
        candidates = self.pre_ranker.rank(
            user_features, candidates, top_k=self.PRE_RANK_CANDIDATES
        )
        metrics.pre_rank_ms = (time.perf_counter() - t0) * 1000

        # Stage 4: Neural ranker — 200 → 20 (GPU)
        t0 = time.perf_counter()
        results = self.neural_ranker.rank(
            user_features, candidates, top_k=self.FINAL_K
        )
        metrics.rank_ms = (time.perf_counter() - t0) * 1000
        metrics.total_ms = (time.perf_counter() - start) * 1000

        # Log latency breakdown for monitoring
        if metrics.total_ms > self.LATENCY_BUDGET_MS:
            log_latency_violation(user_id, metrics)  # Alert on p99 breaches

        return results, metrics

TIP

Interview Checklist: Model Serving

When designing the serving layer in an ML system design interview, cover these four points:

Batch vs real-time decision with justification: 'This model uses the user's current session context, so we need real-time inference. If it only used daily-aggregated features, batch pre-computation would be sufficient.'
Deployment safety: 'New model versions go through shadow deployment first (24–48hr) to verify prediction consistency and latency, then canary at 1–5% traffic with online metric monitoring before full rollout.'
Latency budget breakdown: 'Our 200ms budget allocates: 5ms feature fetch, 30ms ANN retrieval, 15ms pre-ranker, 60ms neural ranker, 10ms post-processing. We have a 80ms buffer for network and tail latency.'
Rollback mechanism: 'The model registry has both the new and previous model version active. If online metrics degrade during canary, we re-route traffic to the previous version in < 2 minutes via load balancer weight change.'

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.