Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
ML System Design: Video Recommendation System
ML System Design
ML System Design: 6-Step Framework
ML System Design
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
ML Evaluation Metrics: The Complete Guide
Machine Learning
ML System Design: E-commerce Recommendation System
ML System Design
ML System Design: LLM Serving Systems
Design a production LLM serving system from first principles — covering PagedAttention and KV cache management, continuous batching for 2-5× throughput gains, multi-LoRA serving with S-LoRA and InfiniLoRA, the full RLHF pipeline (SFT → reward model → PPO vs DPO vs GRPO), and cost-per-token engineering. Includes break-even analysis for self-hosting vs cloud, failure mode catalog, and what each interview level must cover.
The Core Serving Problem — KV Cache and Memory Fragmentation
Building an LLM is a research problem. Serving an LLM profitably is an engineering problem. They require completely different skills, and most interview prep has not caught up.
The standard MLSD guidance on LLM serving: "use a GPU cluster, batch requests, return completions." That answer is as useful as saying "use a database" when asked to design a payment system.
LLMs generate text token-by-token in an autoregressive loop. Each forward pass generates exactly one token. To generate 200 tokens, you run 200 forward passes. This creates constraints that don't exist in other ML serving contexts.
The KV cache problem:
During autoregressive generation, each forward pass attends to all previously generated tokens. The key-value matrices for all previous tokens are cached in GPU memory — this is the KV cache. Without it, you'd recompute attention over the entire sequence on every step (quadratically expensive).
The KV cache grows linearly with sequence length. For a 7B model with 4096-token context:
- Cache per token:
2 × num_layers × num_heads × head_dim × sizeof(float16)= 2 × 32 × 32 × 128 × 2 = 524 KB per token - Maximum KV cache per request: 4096 × 524 KB ≈ 2 GB
- A100 80GB: 13 GB model weights + 50 concurrent requests × 2 GB = 113 GB — doesn't fit
The memory math is brutal. A naively implemented serving system holds only 30-40 concurrent requests before running out of memory.
The memory fragmentation problem:
KV cache is allocated at request arrival but sequence length is unknown. Pre-allocate for maximum (4096 tokens) → waste memory for short requests. Allocate dynamically → memory fragmentation as requests complete at different times. Traditional allocators waste 60-80% of available KV cache memory due to fragmentation. On a $3/hour A100, you're paying for idle GPU compute.
What Interviewers Are Evaluating at Each Level
Mid-level: Can you describe the autoregressive generation process and why it's different from standard neural network inference? Do you understand why batching matters? Can you name vLLM as a concrete tool?
Senior-level: Can you explain PagedAttention and why it reduces memory waste to <5%? Do you understand continuous batching vs static batching and the 2-5× throughput difference? Can you describe S-LoRA for multi-tenant adapter serving? Do you have a framework for cost-per-token calculation?
Staff-level: Can you design the full RLHF pipeline with engineering tradeoffs between PPO/DPO/GRPO? Do you understand disaggregated prefill-decode? Can you compute the break-even analysis for self-hosting vs cloud at a given token volume? Do you articulate failure modes around KV cache thrashing, reward hacking, and adapter cold starts?
Clarifying Questions — Ask These First
What model scale and family?
7B models fit on a single A100 80GB. 70B models require 4-8 A100 80GB with tensor parallelism. 400B+ (GPT-4 scale) requires pipeline parallelism across nodes. The serving architecture changes significantly by model size.
What's the latency profile?
TTFT (Time to First Token) — the user sees the first token. TPOT (Time Per Output Token) — streaming generation speed. For interactive chat: TTFT <500ms, TPOT <50ms. For batch inference: latency matters less than throughput. These two requirements optimize for different things.
Single model or multi-tenant LoRA?
Serving one base model is straightforward. Serving 100 fine-tuned LoRA adapters from the same base model is the multi-tenant problem that S-LoRA / InfiniLoRA solve. Ask explicitly — it changes the memory and scheduling architecture.
What alignment requirement?
Pre-trained base model only vs instruction-tuned vs RLHF-aligned. Each stage has different training infrastructure requirements. PPO requires 4× model memory. DPO requires 2×. This shapes the training cluster design.
What cost sensitivity?
High-volume, cost-sensitive applications (>10M tokens/day) benefit from self-hosting or hybrid. Low-volume, quality-sensitive applications should use cloud APIs. The break-even calculation is an explicit staff-level question.