Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Optimization & Training: SGD to AdamW, Learning Rate Scheduling, and Gradient Flow
The mechanics behind every successful model training run. Covers SGD with momentum, Adam, AdamW, and their mathematical differences; learning rate warmup and cosine decay schedules (with production evidence); gradient clipping; mixed-precision training; and the most common training failure modes with their exact fixes.
Gradient Descent: From SGD to Modern Adaptive Optimizers
All deep learning optimization starts with gradient descent: move parameters in the direction that most reduces the loss. The variants differ in how much and how quickly they move.
Stochastic Gradient Descent (SGD):
θₜ₊₁ = θₜ - η·∇L(θₜ)
Where η is the learning rate. 'Stochastic' means computing the gradient on a mini-batch, not the full dataset. SGD with batch size B approximates the full gradient with variance ∝ 1/B. Larger batches → less noisy gradient estimate, but more compute per step and diminishing returns on generalization.
SGD with Momentum (the production default for CNNs):
vₜ = β·vₜ₋₁ + ∇L(θₜ) (accumulate gradient history)
θₜ₊₁ = θₜ - η·vₜ
With β=0.9 (standard): the effective gradient is a weighted sum of the last ~10 gradients (1/(1-0.9) = 10 steps). Momentum smooths noisy gradient directions and accumulates velocity in consistent directions — making learning faster in gentle slopes and more stable in ravines.
Adam (Kingma & Ba, 2015): Adapts the learning rate per parameter based on historical gradient statistics.
mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ (1st moment: gradient mean, β₁=0.9)
vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ² (2nd moment: gradient variance, β₂=0.999)
m̂ₜ = mₜ/(1-β₁ᵗ) (bias correction)
v̂ₜ = vₜ/(1-β₂ᵗ) (bias correction)
θₜ₊₁ = θₜ - η·m̂ₜ/(√v̂ₜ + ε) (ε=1e-8)
Parameters with large gradient variance (unpredictable direction) get a small effective learning rate. Parameters with consistent gradient direction get a larger effective step. Adam converges fast, requires less learning rate tuning, and is the default for transformers, LLMs, and anything with embedding layers.
When SGD > Adam: For CNNs (ResNet, EfficientNet), SGD+momentum with a carefully tuned learning rate schedule often achieves better generalization than Adam — Adam's adaptivity can converge to sharper minima that generalize worse. SGD is the standard in ImageNet training. Adam is the standard for transformers.