How to scale model training from a single GPU to thousands. Covers data parallelism with Ring AllReduce, model/tensor/pipeline parallelism for LLMs, PyTorch DDP vs FSDP2, and how to choose the right strategy based on model size vs data volume.

35 min read 2 sections 1 interview questions

Distributed TrainingData ParallelismModel ParallelismRing AllReduceFSDPPyTorch DDPTensor ParallelismPipeline ParallelismGradient SynchronizationGPU ClusterNVLinkInfiniBand

Why Distributed Training Exists — And When You Actually Need It

Single-GPU training breaks for two distinct reasons:

Reason 1: Data volume. Training on 10B examples would take 6 months on a single GPU. You need to parallelize across many GPUs processing different data subsets simultaneously. Solution: data parallelism.

Reason 2: Model size. A 70B parameter model at float32 requires 280GB of GPU memory — no single GPU has that. You need to distribute the model itself across multiple GPUs. Solution: model parallelism (tensor, pipeline, or both).

Most candidates conflate these two problems. In an interview, the correct answer starts by identifying which problem you have:

Training a large model with a small-to-medium dataset on a reasonable number of parameters? → Data parallelism (DDP or FSDP)
Training or fine-tuning an LLM that doesn't fit on one GPU? → Model parallelism (tensor + pipeline) combined with data parallelism
Training a recommendation model on petabytes of sparse features? → Data parallelism with parameter servers (special case)

For the vast majority of ML system design interviews (recommendation, search, fraud, ranking), data parallelism is the right answer. Model parallelism only comes up for LLM/GenAI-specific discussions.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade