The sparse activation architecture that powers GPT-4, Mixtral 8x7B, and DeepSeek-V3. Covers top-k gating math, router training with load-balancing losses, capacity factor, expert-choice vs token-choice routing, expert parallelism with all-to-all communication, and why MoE gives 10x parameters at constant FLOPs per token. Includes 8 hard interview questions.

55 min read 3 sections 1 interview questions

Mixture of ExpertsMoESparse ModelsSwitch TransformerMixtralDeepSeekGLaMExpert ParallelismLoad Balancing LossTop-k GatingGShardCapacity Factor

Why Mixture of Experts Exists — The Parameter/FLOP Decoupling Problem

Dense transformers hit a wall: doubling parameters doubles FLOPs per token, which doubles training cost and halves inference throughput. If you want GPT-4 quality, you pay GPT-4 compute. This is the fundamental scaling constraint MoE breaks.

The core idea: replace each dense feed-forward layer (which every token traverses in full) with N experts and a small router. For each token, the router picks the top-k experts (typically k=1 or k=2) and only those experts compute. The other N-k experts stay idle for that token. Total parameters grow N-fold; FLOPs per token grow only k-fold.

Concrete numbers: Mixtral 8x7B has 47B total parameters but activates only ~13B per token. It trains and serves with the FLOPs of a 13B dense model but the capacity of a 47B model. On MMLU, it matches Llama 2 70B while being 4x cheaper to serve.

The misconception to kill early: MoE is not an ensemble. Ensembles average multiple independent models' predictions — every model runs on every input. MoE runs a single model where a routing decision selects a sparse subset of parameters per token. Ensembles add FLOPs; MoE decouples parameters from FLOPs. This distinction is the most common MoE interview trap.

IMPORTANT

What Interviewers Evaluate on MoE

A 6/10 answer says 'MoE uses multiple experts with a router to pick which one runs.' A 9/10 answer:

Decouples parameters from FLOPs explicitly — the total param count is N x dense_FFN, but compute per token is k x dense_FFN. Memory scales with total params; compute scales with active params.
Names the load-balancing problem — without auxiliary loss, the router collapses to 1-2 favorite experts and the rest die. Cites Shazeer 2017's alpha * N * Sum(f_i * P_i) or DeepSeek-V3's loss-less bias term.
Discusses capacity factor — each expert has a fixed token budget per batch; overflow tokens get dropped or bypassed. Too low C = dropped tokens, too high C = wasted compute and memory.
Explains expert parallelism — in distributed training, experts shard across GPUs, so routing triggers an all-to-all communication. This is the dominant bottleneck at scale and the reason MoE training is hard.
Contrasts MoE tradeoffs honestly — wins at pretraining scale, loses at single-GPU deployment and small models (<7B). Mentions DeepSeek-V3, Mixtral 8x7B, Switch Transformer, GLaM by name.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade