Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
RAG Architecture: From Basics to Production
GenAI & Agents
LLM Fine-Tuning: LoRA, QLORA, PEFT & RLHF
GenAI & Agents
LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes
GenAI & Agents
Speculative Decoding: 2-4x LLM Inference Speedup Without Quality Loss
GenAI & Agents
LLM Gateway: Routing, Guardrails, Quotas, and Observability for Production GenAI
GenAI & Agents
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
GenAI & Agents
ML System Design: LLM Serving Systems
ML System Design
GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization
ML System Design
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
ML System Design
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
The engineering behind serving large language models at high throughput and low latency. Covers prefill/decode distinction, KV cache memory math, PagedAttention, continuous batching, speculative decoding, Flash Attention, MQA/GQA, quantization, and the LLMOps discipline (versioning, release gates, canary, rollback) needed to deploy them safely.
Why LLM Serving is Hard
LLMs have two distinct computational phases with radically different resource profiles.
Prefill phase — processes all input tokens simultaneously in a single forward pass. Resource profile: compute-bound — GPU utilization near 100%.
Decode phase — generates one token at a time, each token requiring a full forward pass through the entire model. Resource profile: memory-bandwidth-bound — GPU utilization only 30–40% naively.
The root cause: each decode step must read all model weights (~140 GB for a 70B model) from HBM just to compute one output token. The hardware can compute far faster than it can move data.
Two key metrics define a serving system:
- TTFT (Time To First Token) — dominated by prefill speed
- TBT (Time Between Tokens) — dominated by decode memory bandwidth
All major optimizations target one or both of these separately.