Sections
Related Guides
GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization
ML System Design
Experiment Tracking & Model Registry: The Version Control for ML
ML System Design
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
How to design the serving layer for ML models in production — when to use batch pre-computation vs real-time inference, how to safely deploy new models via shadow and canary patterns, and how to structure a multi-stage serving pipeline within a latency budget.
Batch vs Real-Time Serving — The First Decision
The single most important architectural decision in ML serving: does the user's experience tolerate pre-computed predictions, or does the model need to score fresh inputs at request time? (See Chip Huyen's ML Systems Design: https://huyenchip.com/machine-learning-systems-design/toc.html)
Batch serving (pre-computation): Run the model offline over all entities (users, items) and store predictions in a database. At request time, just do a key-value lookup. Latency: < 1ms (DB lookup). Freshness: stale by hours or days.
Use batch serving when:
- Predictions don't depend on the specific request context ('top-10 items for user X' is the same whether they request at 9am or 9:01am)
- Model is expensive (transformer-based) but candidates are bounded
- Latency budget is tight and you can't afford real-time inference
- Example: email recommendation, weekly playlist generation, pre-computed similar items
Real-time serving (on-demand inference): Score inputs at the moment of the request, using the live request context. Latency: 10–200ms. Freshness: perfect — uses the user's current state.
Use real-time when:
- Predictions depend on live context (current query, current cart state, live fraud signals)
- Personalization must reflect very recent actions (user just clicked on sneakers → real-time recommendation should surface sneakers)
- Example: search ranking, fraud scoring, real-time ad targeting
The hybrid pattern (most production systems): Pre-compute user and item embeddings offline (batch). At serving time, run the ANN search and ranking stages in real-time using those embeddings. You get freshness where it matters (ranking uses live signals) without running the expensive encoder at request time.
Batch vs Real-Time Serving Decision and Architecture
Serving Architecture Decision — Key Questions to Ask First
What is the required latency (P99 SLA)?
< 50ms: you need pre-computed results cached in Redis, or a very small model running in-process. 50–200ms: real-time inference with feature fetch + small model is feasible. 200ms–2s: larger models on GPU, with potential for precomputed components. > 2s: batch serving or async (compute results in background, surface on next page load) is acceptable and dramatically cheaper.
How often do predictions need to be updated?
Predictions valid for hours (newsletter recommendations, weekly content digest): batch inference, compute offline, store in Redis/DynamoDB, read at serve time. Predictions that depend on real-time context (fraud, live ranking with session features, search): only real-time serving works — you cannot precompute because the input changes per request.
What is the model size and computational cost?
Lightweight models (LightGBM, small MLP): CPU serving at 5–20ms, no GPU needed, scale horizontally. Large models (transformer ranker, LLM): GPU serving required, higher infrastructure cost, must use batching (static or continuous) to amortize GPU overhead. Hybrid: run light model on CPU for latency-critical path; heavy model offline for batch scoring.
What deployment risk are you willing to accept at rollout?
Zero risk tolerance: shadow deployment first (parallel scoring, no user impact). Low risk: canary to 1–5% traffic, monitor for 48 hours. Standard: 10% canary for 24 hours, then ramp. The shadow → canary → ramp sequence is industry standard. Blue-green is for systems where instantaneous traffic switch is preferred over gradual ramp (rare in ML).
Deployment Strategies — How to Ship Without Breaking Production
Deploying a new model version is the highest-risk operation in ML engineering. The model has passed offline evaluation but might behave differently in production due to distribution shift, feature differences, or unexpected user behavior patterns. Three strategies manage this risk:
Shadow deployment (zero risk): The new model runs in parallel with the production model, receiving the same traffic. Both models score every request, but only the production model's predictions are served to users. Shadow model logs its predictions for comparison. Use this to: verify prediction consistency, catch feature encoding bugs, measure latency of new model under real load, compare prediction distributions. Duration: 24–72 hours. If shadow model predictions are statistically similar to production and latency is within budget → promote to canary.
Canary deployment (low risk): Route a small fraction of traffic (1–5%) to the new model. Real users are affected. Monitor online metrics (CTR, conversion, latency, error rate) for the canary group vs control. If metrics are neutral or positive after 48–72 hours → expand to 10% → 25% → 50% → 100%. Rollback trigger: any guard-rail metric (revenue, error rate, p99 latency) degrades beyond pre-defined threshold.
Blue-green deployment (clean rollback): Two identical serving environments. 'Blue' serves production traffic. 'Green' runs the new model. Switch traffic from blue to green instantaneously at the load balancer. Rollback: switch back to blue (< 1 minute). Downside: requires 2× serving infrastructure during transition.
The shadow → canary → full rollout sequence is standard at mature ML teams. Never go directly to full rollout without at least a canary.
Serving Architecture Patterns — When to Use Each
| Pattern | Latency | Freshness | Cost | Best Use Case | Production Example |
|---|---|---|---|---|---|
| Batch precomputation (cache + read) | < 5ms read | Hours to days | Low | Recommendations that don't depend on real-time context: email digest, weekly playlists | Netflix 'Top Picks' row, Spotify Discover Weekly |
| Real-time CPU inference | 10–50ms | Per-request | Medium | Lightweight models with real-time context: fraud scoring, personalized ranking with fresh features | Stripe fraud (GBDT), Twitter light ranker |
| Real-time GPU inference | 20–100ms | Per-request | High | Large neural rankers, embedding generation, LLM completion | YouTube deep ranker, GPT-4.1 API |
| Async/nearline inference | 1–60 sec (buffered) | Near-real-time | Medium | Notification ranking where slight delay is acceptable; batch triggered by events | LinkedIn notification pipeline, Airbnb pricing updates |
| Disaggregated prefill-decode (LLM) | TTFT < 500ms, streaming | Per-request | Very high | High-throughput LLM with mixed short/long prompts where TTFT and TPOT need separate optimization | vLLM disaggregated, Google Gemini serving |
The Silent Serving Bug: Feature Mismatch at Deployment
The most common failure when deploying a new model: the serving code uses a different feature encoding than the training code. Model was trained with user_age = (current_year - birth_year). Serving computes user_age = current_year - birth_year - 1 (off-by-one in birthday calculation). The model silently serves degraded predictions — no error, no exception.
Shadow deployment catches this because you can compare the shadow model's prediction distribution to your offline evaluation predictions on the same inputs. If they diverge, there's a serving bug.
Permanent fix: feature store with registered feature definitions. The model was trained using features from the store. The serving pipeline fetches features from the same store using the same definitions. The encoding is never re-implemented independently.
Latency Budget Decomposition — Designing for End-to-End SLAs
Every production ML serving system has a latency budget — the maximum time the user will wait before seeing a result. The serving architecture must be designed to fit within it.
A real-time recommendation system with a 200ms user-visible budget:
| Stage | Component | Typical Latency | Notes |
|---|---|---|---|
| Network | Client → CDN → Load balancer | 10–20ms | Geography-dependent |
| Feature fetch | Redis online store (10 features) | 2–5ms | Batch fetch, not serial |
| ANN retrieval | FAISS IVF on 500M items → 1000 candidates | ~25–35ms | Dominant stage |
| Pre-ranker | LightGBM on 1000 candidates, CPU | 10–20ms | CPU parallelism |
| Neural ranker | Transformer on 200 candidates, GPU | 40–80ms | Dominant stage |
| Post-processing | Diversity, freshness rules | 2–5ms | |
| Network return | Service → client | 10–20ms |
Total: ~100–185ms. Within 200ms budget with headroom.
Common mistake: designing each stage in isolation without summing. A 'reasonable' 50ms per stage × 5 stages = 250ms — over budget before accounting for network.
The cascade rule: each stage in a multi-stage pipeline should reduce the candidate set so the next stage runs on fewer (more expensive) computations. If the neural ranker can score 1000 candidates in 80ms, that's acceptable. If you skip the pre-ranker and send 1000 candidates directly to the neural ranker, you're running the most expensive stage at full scale — fine if your budget allows, but verify with load tests.
Multi-Stage Serving Pipeline with Latency Tracking
import time
from dataclasses import dataclass
from typing import List
@dataclass
class ServingMetrics:
feature_fetch_ms: float = 0
retrieval_ms: float = 0
pre_rank_ms: float = 0
rank_ms: float = 0
total_ms: float = 0
class RecommendationPipeline:
LATENCY_BUDGET_MS = 150 # Internal budget (200ms user-visible - 50ms network)
RETRIEVAL_CANDIDATES = 1000
PRE_RANK_CANDIDATES = 200
FINAL_K = 20
def __init__(self, feature_store, ann_index, pre_ranker, neural_ranker):
self.feature_store = feature_store
self.ann_index = ann_index
self.pre_ranker = pre_ranker
self.neural_ranker = neural_ranker
def serve(self, user_id: str, context: dict) -> tuple[List[str], ServingMetrics]:
metrics = ServingMetrics()
start = time.perf_counter()
# Stage 1: Feature fetch (batch call, not serial per feature)
t0 = time.perf_counter()
user_features = self.feature_store.get_online_features(
entity_id=user_id,
feature_names=["user_embedding", "user_history_30d", "user_session"]
)
metrics.feature_fetch_ms = (time.perf_counter() - t0) * 1000
# Stage 2: ANN retrieval — 500M items → 1000 candidates
t0 = time.perf_counter()
candidates = self.ann_index.search(
query=user_features["user_embedding"],
k=self.RETRIEVAL_CANDIDATES
)
metrics.retrieval_ms = (time.perf_counter() - t0) * 1000
# Stage 3: Pre-ranker — 1000 → 200 (CPU, fast)
t0 = time.perf_counter()
candidates = self.pre_ranker.rank(
user_features, candidates, top_k=self.PRE_RANK_CANDIDATES
)
metrics.pre_rank_ms = (time.perf_counter() - t0) * 1000
# Stage 4: Neural ranker — 200 → 20 (GPU)
t0 = time.perf_counter()
results = self.neural_ranker.rank(
user_features, candidates, top_k=self.FINAL_K
)
metrics.rank_ms = (time.perf_counter() - t0) * 1000
metrics.total_ms = (time.perf_counter() - start) * 1000
# Log latency breakdown for monitoring
if metrics.total_ms > self.LATENCY_BUDGET_MS:
log_latency_violation(user_id, metrics) # Alert on p99 breaches
return results, metrics
Interview Checklist: Model Serving
When designing the serving layer in an ML system design interview, cover these four points:
-
Batch vs real-time decision with justification: 'This model uses the user's current session context, so we need real-time inference. If it only used daily-aggregated features, batch pre-computation would be sufficient.'
-
Deployment safety: 'New model versions go through shadow deployment first (24–48hr) to verify prediction consistency and latency, then canary at 1–5% traffic with online metric monitoring before full rollout.'
-
Latency budget breakdown: 'Our 200ms budget allocates: 5ms feature fetch, 30ms ANN retrieval, 15ms pre-ranker, 60ms neural ranker, 10ms post-processing. We have a 80ms buffer for network and tail latency.'
-
Rollback mechanism: 'The model registry has both the new and previous model version active. If online metrics degrade during canary, we re-route traffic to the previous version in < 2 minutes via load balancer weight change.'
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →