Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Model Serving Architectures: Batch vs Real-Time, Shadow Deployments & Latency Budgets
ML System Design
Distributed Training: Data Parallelism, Model Parallelism, and FSDP
ML System Design
Two-Stage Retrieval & Ranking: The Architecture Behind Every Large-Scale Recommender
ML System Design
ML System Design: 6-Step Framework
ML System Design
GPU Infrastructure for ML Serving: Quantization, Batching & Inference Optimization
The engineering decisions that determine whether your model serves at 10ms or 200ms — GPU selection, quantization (INT8/FP16/FP8), dynamic batching, KV cache management, and when to use Triton vs vLLM vs TensorRT-LLM.
GPU Selection — Not All GPUs Are Equal for Inference
The GPU you choose determines your latency floor, throughput ceiling, and cost per inference. For ML system design interviews, knowing the rough specs of major GPU families shows production credibility.
A100 (80GB HBM2e): The standard production GPU for both training and inference of large models. 312 TFLOPS (FP16), 77.6 TFLOPS (FP32), 624 GB/s memory bandwidth. NVLink4 for multi-GPU (600 GB/s intra-node). The workhorse of ML infrastructure at Meta, Google, Microsoft.
H100 (80GB HBM3): ~2× the A100 in throughput due to FP8 support (natively) and faster HBM3 (3.35 TB/s). Flash Attention 2 runs significantly faster. Use for LLM inference where memory bandwidth is the bottleneck (not compute). Roughly 3× the cost of A100.
A10G (24GB GDDR6): The AWS g5 instance GPU. 125 TFLOPS FP16. Suitable for smaller model inference (<10B parameters), lower cost. Common for fine-tuned 7B model serving.
T4 (16GB GDDR6): Legacy inference GPU. INT8 optimized. Cheap. For small models (BERT-base, XGBoost ensemble via GPU). Not suitable for modern LLMs.
Rule of thumb: For latency-critical inference of models < 7B: A10G or T4. For models 7–70B: A100. For frontier models (70B+) with high-throughput requirements: H100.