Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Sections

0/3

Related Guides

LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps

GenAI & Agents

60m

LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes

GenAI & Agents

50m

LLM Fundamentals — Transformers, Attention & Architecture

GenAI & Agents

100m

Quiz

← Back to Library

GenAI & Agents·Advanced

Speculative Decoding: 2-4x LLM Inference Speedup Without Quality Loss

How speculative decoding exploits GPU underutilization in the decode phase to achieve 2-4x speedup with mathematically guaranteed output distribution equivalence. Covers draft-verify mechanics, acceptance probability, Medusa heads, Lookahead decoding, and the non-obvious constraints around tokenizer matching.

40 min read 3 sections 1 interview questions

Speculative DecodingLLM InferenceAutoregressive DecodingDraft ModelMedusa HeadsLookahead DecodingvLLMToken GenerationDecode OptimizationMemory BandwidthKV CacheSpecInferInference ServingThroughput

The Fundamental Bottleneck — Why Autoregressive Decoding Is Slow

Autoregressive LLM decoding generates tokens one at a time. Generating a 200-token response from Llama 3 70B requires 200 separate forward passes — each reading all 140GB of model weights from GPU HBM, computing attention over the growing KV cache, then sampling exactly one token.

The hardware waste: An H100 has 3.35 TB/s of HBM memory bandwidth and ~2,000 TFLOPS of BF16 compute. During decode, the GPU runs at 30–40% of compute utilization because each decode step requires loading weights but only performs a tiny matrix multiply (batch_size × d_model × d_model for one token vs. the full sequence matrix multiply during prefill). The GPU is memory-bandwidth-bound, not compute-bound.

To quantify: loading Llama 3 70B weights (140GB in FP16) from HBM takes 140GB / 3,350 GB/s ≈ 42ms minimum per token, regardless of how fast the Tensor Cores can compute. At 42ms/token: 200 tokens = 8.4 seconds. The GPU's 2,000 TFLOPS sits idle 60%+ of the time waiting for memory.

The speculative decoding insight: if you can verify K tokens in a single forward pass — the same cost as generating 1 token — you've effectively generated K tokens for the cost of 1. This is possible because the large model's forward pass is already parallel over sequence length during verification.

IMPORTANT

The Key Guarantee — Output Distribution Equivalence

Speculative decoding is lossless: the token distribution of the output is mathematically identical to sampling directly from the large model without speculative decoding. This is not an approximation — it is a theorem proven via rejection sampling.

The acceptance/rejection procedure ensures: if you run 1M speculative decoding requests vs 1M standard decoding requests with the same temperature and prompts, the output token distributions are identical. This is what separates speculative decoding from approximation methods like quantization or pruning, which accept a quality tradeoff. Speculative decoding is pure latency optimization with zero quality tradeoff.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade