Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes
How to compress LLMs from 140GB to 35GB without destroying quality. Covers PTQ vs QAT, INT8 absmax/zero-point methods, GPTQ Hessian-based INT4, AWQ salient-weight protection, bitsandbytes mixed-precision, and the calibration dataset trap most engineers miss.
Why Quantization Exists — The Memory Math
Llama 3 70B in FP16 requires 140GB of GPU HBM. A single H100 has 80GB. You need two H100s just to load the weights — before any KV cache or activations. An H100 costs ~$25K/month on-demand. The economics of FP16 at scale are brutal.
Quantization reduces the precision of weights (and optionally activations) to smaller data types:
- FP16 → INT8: 140GB → 70GB (Llama 3 70B fits on one H100 with room for KV cache)
- FP16 → INT4: 140GB → 35GB (fits on one H100 with generous KV cache headroom)
The tradeoff isn't just memory: lower precision means faster matrix multiplications. INT8 matrix multiply (IMMA) on A100/H100 runs at 2× the throughput of FP16 GEMM. INT4 (via packing two INT4 values into one INT8 register) achieves comparable throughput with 4× memory reduction.
Why this matters at interview: "Quantize the model" is a sentence every candidate says. What separates a senior engineer is knowing which quantization method to choose, which layers to quantize, which layers to skip, and what calibration data to use — the decisions that determine whether you get a 2% perplexity drop or a 15% quality collapse.
The Three Quantization Decisions That Determine Success
- What to quantize: Weights only (W-only) or weights + activations (W+A). Activations have dynamic range 10-100× larger than weights — W+A quantization is harder but enables faster INT8 GEMM on actual hardware.
- Granularity: Per-tensor (one scale for entire weight matrix), per-channel/row (one scale per output channel), per-group (one scale per 128 weights). Finer granularity = better quality, more overhead.
- Calibration data: The dataset you run through the model to determine quantization scales. Using the wrong domain data is the most common production mistake — a model calibrated on C4 Wikipedia text but deployed for SQL generation will have 2-3× worse accuracy drop than one calibrated on SQL examples.