Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Sections

0/2

Related Guides

Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets

ML System Design

30m

Distributed Training: Data Parallelism, Model Parallelism, and FSDP

ML System Design

35m

Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender

ML System Design

40m

ML System Design: 6-Step Framework

ML System Design

40m

Quiz

← Back to Library

ML System Design·Advanced

GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization

The engineering decisions that determine whether your model serves at 10ms or 200ms — GPU selection, quantization (INT8/FP16/FP8), dynamic batching, KV cache management, and when to use Triton vs vLLM vs TensorRT-LLM.

30 min read 2 sections 1 interview questions

GPUNVIDIAA100H100QuantizationINT8FP16Triton Inference ServerTensorRTDynamic BatchingKV CachevLLMPagedAttentionModel ServingInference Optimization

GPU Selection — Not All GPUs Are Equal for Inference

The GPU you choose determines your latency floor, throughput ceiling, and cost per inference. For ML system design interviews, knowing the rough specs of major GPU families shows production credibility.

A100 (80GB HBM2e): The standard production GPU for both training and inference of large models. 312 TFLOPS (FP16), 77.6 TFLOPS (FP32), 624 GB/s memory bandwidth. NVLink4 for multi-GPU (600 GB/s intra-node). The workhorse of ML infrastructure at Meta, Google, Microsoft.

H100 (80GB HBM3): ~2× the A100 in throughput due to FP8 support (natively) and faster HBM3 (3.35 TB/s). Flash Attention 2 runs significantly faster. Use for LLM inference where memory bandwidth is the bottleneck (not compute). Roughly 3× the cost of A100.

A10G (24GB GDDR6): The AWS g5 instance GPU. 125 TFLOPS FP16. Suitable for smaller model inference (<10B parameters), lower cost. Common for fine-tuned 7B model serving.

T4 (16GB GDDR6): Legacy inference GPU. INT8 optimized. Cheap. For small models (BERT-base, XGBoost ensemble via GPU). Not suitable for modern LLMs.

Rule of thumb: For latency-critical inference of models < 7B: A10G or T4. For models 7–70B: A100. For frontier models (70B+) with high-throughput requirements: H100.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade