The mechanics behind every successful model training run. Covers SGD with momentum, Adam, AdamW, and their mathematical differences; learning rate warmup and cosine decay schedules (with production evidence); gradient clipping; mixed-precision training; and the most common training failure modes with their exact fixes.

40 min read 2 sections 1 interview questions

OptimizationSGDAdamAdamWLearning Rate SchedulingGradient ClippingMixed PrecisionWarmupCosine DecayGradient AccumulationBatch SizeTraining Stability

Gradient Descent: From SGD to Modern Adaptive Optimizers

All deep learning optimization starts with gradient descent: move parameters in the direction that most reduces the loss. The variants differ in how much and how quickly they move.

Stochastic Gradient Descent (SGD): θₜ₊₁ = θₜ - η·∇L(θₜ)

Where η is the learning rate. 'Stochastic' means computing the gradient on a mini-batch, not the full dataset. SGD with batch size B approximates the full gradient with variance ∝ 1/B. Larger batches → less noisy gradient estimate, but more compute per step and diminishing returns on generalization.

SGD with Momentum (the production default for CNNs): vₜ = β·vₜ₋₁ + ∇L(θₜ) (accumulate gradient history) θₜ₊₁ = θₜ - η·vₜ

With β=0.9 (standard): the effective gradient is a weighted sum of the last ~10 gradients (1/(1-0.9) = 10 steps). Momentum smooths noisy gradient directions and accumulates velocity in consistent directions — making learning faster in gentle slopes and more stable in ravines.

Adam (Kingma & Ba, 2015): Adapts the learning rate per parameter based on historical gradient statistics.

mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ   (1st moment: gradient mean, β₁=0.9)
vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ²  (2nd moment: gradient variance, β₂=0.999)
m̂ₜ = mₜ/(1-β₁ᵗ)            (bias correction)
v̂ₜ = vₜ/(1-β₂ᵗ)            (bias correction)
θₜ₊₁ = θₜ - η·m̂ₜ/(√v̂ₜ + ε)  (ε=1e-8)

Parameters with large gradient variance (unpredictable direction) get a small effective learning rate. Parameters with consistent gradient direction get a larger effective step. Adam converges fast, requires less learning rate tuning, and is the default for transformers, LLMs, and anything with embedding layers.

When SGD > Adam: For CNNs (ResNet, EfficientNet), SGD+momentum with a carefully tuned learning rate schedule often achieves better generalization than Adam — Adam's adaptivity can converge to sharper minima that generalize worse. SGD is the standard in ImageNet training. Adam is the standard for transformers.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade