Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
Neural Networks: Backpropagation, Activations & Training
Machine Learning
Optimization & Training: SGD to AdamW, Learning Rate Scheduling, and Gradient Flow
Machine Learning
Normalization Deep-Dive: BatchNorm, LayerNorm, GroupNorm & RMSNorm
Machine Learning
NLP Fundamentals: Tokenization, Embeddings, BERT vs GPT, and Fine-Tuning
Machine Learning
Mixture of Experts (MoE): Sparse Scaling Behind GPT-4 & Mixtral
The sparse activation architecture that powers GPT-4, Mixtral 8x7B, and DeepSeek-V3. Covers top-k gating math, router training with load-balancing losses, capacity factor, expert-choice vs token-choice routing, expert parallelism with all-to-all communication, and why MoE gives 10x parameters at constant FLOPs per token. Includes 8 hard interview questions.
Why Mixture of Experts Exists — The Parameter/FLOP Decoupling Problem
Dense transformers hit a wall: doubling parameters doubles FLOPs per token, which doubles training cost and halves inference throughput. If you want GPT-4 quality, you pay GPT-4 compute. This is the fundamental scaling constraint MoE breaks.
The core idea: replace each dense feed-forward layer (which every token traverses in full) with N experts and a small router. For each token, the router picks the top-k experts (typically k=1 or k=2) and only those experts compute. The other N-k experts stay idle for that token. Total parameters grow N-fold; FLOPs per token grow only k-fold.
Concrete numbers: Mixtral 8x7B has 47B total parameters but activates only ~13B per token. It trains and serves with the FLOPs of a 13B dense model but the capacity of a 47B model. On MMLU, it matches Llama 2 70B while being 4x cheaper to serve.
The misconception to kill early: MoE is not an ensemble. Ensembles average multiple independent models' predictions — every model runs on every input. MoE runs a single model where a routing decision selects a sparse subset of parameters per token. Ensembles add FLOPs; MoE decouples parameters from FLOPs. This distinction is the most common MoE interview trap.
What Interviewers Evaluate on MoE
A 6/10 answer says 'MoE uses multiple experts with a router to pick which one runs.' A 9/10 answer:
-
Decouples parameters from FLOPs explicitly — the total param count is N x dense_FFN, but compute per token is k x dense_FFN. Memory scales with total params; compute scales with active params.
-
Names the load-balancing problem — without auxiliary loss, the router collapses to 1-2 favorite experts and the rest die. Cites Shazeer 2017's
alpha * N * Sum(f_i * P_i)or DeepSeek-V3's loss-less bias term. -
Discusses capacity factor — each expert has a fixed token budget per batch; overflow tokens get dropped or bypassed. Too low C = dropped tokens, too high C = wasted compute and memory.
-
Explains expert parallelism — in distributed training, experts shard across GPUs, so routing triggers an all-to-all communication. This is the dominant bottleneck at scale and the reason MoE training is hard.
-
Contrasts MoE tradeoffs honestly — wins at pretraining scale, loses at single-GPU deployment and small models (<7B). Mentions DeepSeek-V3, Mixtral 8x7B, Switch Transformer, GLaM by name.