Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Sections

0/3

Related Guides

LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps

GenAI & Agents

60m

Speculative Decoding: 2-4x LLM Inference Speedup Without Quality Loss

GenAI & Agents

40m

LLM Fundamentals — Transformers, Attention & Architecture

GenAI & Agents

100m

Quiz

← Back to Library

GenAI & Agents·Advanced

LLM Quantization: INT4/INT8, GPTQ, AWQ, and bitsandbytes

How to compress LLMs from 140GB to 35GB without destroying quality. Covers PTQ vs QAT, INT8 absmax/zero-point methods, GPTQ Hessian-based INT4, AWQ salient-weight protection, bitsandbytes mixed-precision, and the calibration dataset trap most engineers miss.

50 min read 3 sections 1 interview questions

QuantizationGPTQAWQbitsandbytesINT8INT4Post-Training QuantizationLLM CompressionModel ServingMixed PrecisionCalibrationWeight QuantizationBF16FP8

Why Quantization Exists — The Memory Math

Llama 3 70B in FP16 requires 140GB of GPU HBM. A single H100 has 80GB. You need two H100s just to load the weights — before any KV cache or activations. An H100 costs ~$25K/month on-demand. The economics of FP16 at scale are brutal.

Quantization reduces the precision of weights (and optionally activations) to smaller data types:

FP16 → INT8: 140GB → 70GB (Llama 3 70B fits on one H100 with room for KV cache)
FP16 → INT4: 140GB → 35GB (fits on one H100 with generous KV cache headroom)

The tradeoff isn't just memory: lower precision means faster matrix multiplications. INT8 matrix multiply (IMMA) on A100/H100 runs at 2× the throughput of FP16 GEMM. INT4 (via packing two INT4 values into one INT8 register) achieves comparable throughput with 4× memory reduction.

Why this matters at interview: "Quantize the model" is a sentence every candidate says. What separates a senior engineer is knowing which quantization method to choose, which layers to quantize, which layers to skip, and what calibration data to use — the decisions that determine whether you get a 2% perplexity drop or a 15% quality collapse.

IMPORTANT

The Three Quantization Decisions That Determine Success

What to quantize: Weights only (W-only) or weights + activations (W+A). Activations have dynamic range 10-100× larger than weights — W+A quantization is harder but enables faster INT8 GEMM on actual hardware.
Granularity: Per-tensor (one scale for entire weight matrix), per-channel/row (one scale per output channel), per-group (one scale per 128 weights). Finer granularity = better quality, more overhead.
Calibration data: The dataset you run through the model to determine quantization scales. Using the wrong domain data is the most common production mistake — a model calibrated on C4 Wikipedia text but deployed for SQL generation will have 2-3× worse accuracy drop than one calibrated on SQL examples.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade