The engineering behind serving large language models at high throughput and low latency. Covers prefill/decode distinction, KV cache memory math, PagedAttention, continuous batching, speculative decoding, Flash Attention, MQA/GQA, quantization, and the LLMOps discipline (versioning, release gates, canary, rollback) needed to deploy them safely.

60 min read 3 sections 1 interview questions

vLLMKV CachePagedAttentionBatchingSpeculative DecodingQuantizationFlash AttentionGQAPrefix CachingLLMOpsPrompt VersioningCanary ReleaseRelease GatesRollback

Why LLM Serving is Hard

LLMs have two distinct computational phases with radically different resource profiles.

Prefill phase — processes all input tokens simultaneously in a single forward pass. Resource profile: compute-bound — GPU utilization near 100%.

Decode phase — generates one token at a time, each token requiring a full forward pass through the entire model. Resource profile: memory-bandwidth-bound — GPU utilization only 30–40% naively.

The root cause: each decode step must read all model weights (~140 GB for a 70B model) from HBM just to compute one output token. The hardware can compute far faster than it can move data.

Two key metrics define a serving system:

TTFT (Time To First Token) — dominated by prefill speed
TBT (Time Between Tokens) — dominated by decode memory bandwidth

All major optimizations target one or both of these separately.

Prefill vs Decode — Two Radically Different Compute Profiles

Rendering diagram...

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade