Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Machine Learning·Intermediate

Regularization in ML: Controlling Variance Without Killing Signal

A production-first guide to L1, L2, Elastic Net, dropout, and early stopping. Learn the derivation intuition, failure modes, and how to choose regularization under different data regimes and model families.

35 min read 2 sections 1 interview questions
RegularizationL1L2Elastic NetDropoutEarly StoppingOverfittingBias VarianceWeight DecayModel Generalization

Define: What Regularization Actually Does

Regularization constrains model capacity so learned patterns generalize beyond training data. The production framing: regularization is not a mathematical ornament — it is a risk-control mechanism against variance that prevents a model from memorizing training noise rather than learning the underlying signal.

The fundamental problem regularization addresses: modern ML models are overparameterized by design. A neural network with millions of parameters trained on thousands of examples has sufficient capacity to memorize every training example including its noise. Without constraint, gradient descent finds this memorization solution because it minimizes training loss. Regularization modifies the objective to penalize solutions that are complex, making the optimizer prefer simpler explanations that generalize.

L1 (Lasso) adds a penalty proportional to the sum of absolute weight values. The key geometric insight: the L1 ball in weight space has corners at the coordinate axes. When the loss contours touch the L1 constraint region, they most commonly intersect at a corner — meaning one weight is exactly zero. This is why L1 induces sparsity: the gradient of |w| is a constant ±λ regardless of w's magnitude (contrast with L2 where gradient is proportional to w). For a weight near zero, the L1 gradient toward zero wins over the data-fit gradient, driving the weight exactly to zero. L1 is implicit feature selection.

L2 (Ridge) penalizes squared weight values. The gradient ∂(λw²)/∂w = 2λw is proportional to w, so large weights are penalized heavily but small weights receive small penalties. L2 never drives weights to exactly zero — it shrinks all weights by a multiplicative factor each step, distributing influence across many features rather than zeroing unimportant ones. This produces smooth, stable solutions ideal when all features contribute some signal.

Elastic Net combines both: L1 for sparsity on irrelevant features, L2 for stability when relevant features are correlated. Pure L1 behaves erratically when features are highly correlated (randomly zeros one of the pair); L2's grouping effect on correlated features, combined with L1's sparsity, produces more stable feature selection in this regime.

Weight decay vs. L2 regularization: these are mathematically equivalent for SGD (weight decay multiplies weights by (1 - λ) each step, matching L2's gradient ∂λw²/∂w = 2λw). However, for Adam they differ. Adam's adaptive learning rates scale the effective update — when L2 is applied as a gradient penalty, Adam's adaptive scaling also adapts the regularization effect, weakening it for rarely-updated parameters. AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the gradient update, applying the decay directly to weights after the Adam step. This is why AdamW consistently outperforms Adam+L2 in transformers: proper regularization regardless of gradient history.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.