Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Online Learning and Continual Training: Beyond Scheduled Batch Retraining

Production ML systems can't wait for weekly retraining. This guide covers the full continual training spectrum: warm-start fine-tuning, FTRL online learning, and Thompson Sampling bandits — with tiered architecture patterns from Meta's ads system and Google's CTR prediction.

30 min read 2 sections 1 interview questions
Online LearningContinual TrainingWarm-StartThompson SamplingContextual BanditsOnline SGDCatastrophic ForgettingCTR PredictionExploration-ExploitationProduction MLStreaming MLFTRLFollow the Regularized Leader

The Spectrum of Continual Training — Not Just Batch Retraining

Most MLSD resources describe model deployment as a one-shot event followed by "retrain periodically." Production ML is a continuous process, and the retraining strategy is itself a design decision with major performance and cost implications.

The full spectrum of continual training approaches, from least to most adaptive:

ApproachLatency of adaptationCostBest for
Full retraining from scratchHours to daysHighModels with stable distributions, no concept drift
Warm-start fine-tuningMinutes to hoursMediumRegular production retraining on updated data
Mini-batch online SGDSeconds to minutesLowFast-changing distributions; streaming updates
Per-example online SGD (FTRL)MillisecondsVery lowExtreme freshness; feature weights update per event
Contextual banditsMillisecondsVariesExploration-exploitation when labels arrive fast

The appropriate strategy depends on three factors: label arrival latency (if labels take 30 days, online SGD on fresh labels is impossible), distribution change velocity (fraud patterns change daily; user taste changes monthly), and tolerance for catastrophic forgetting (online models can forget important historical patterns when trained only on recent data).

Most production systems at scale use a tiered strategy: a base model retrained weekly in batch (captures stable long-term patterns), plus a shallow online layer updated continuously (captures recent drift). This hybrid captures the best of both worlds.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.