Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
ML Monitoring & Drift Detection: Keeping Models Healthy in Production
ML System Design
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
ML Model Deployment Fundamentals: Shipping Safely in Production
ML System Design
Counterfactual Evaluation: IPS, SNIPS, and Doubly Robust Estimators
ML System Design
Cold Start: Full Architecture for New Users and New Items
ML System Design
How to Design at MLSD: Blank Whiteboard to Production ML
ML System Design
Online Learning and Continual Training: Beyond Scheduled Batch Retraining
Production ML systems can't wait for weekly retraining. This guide covers the full continual training spectrum: warm-start fine-tuning, FTRL online learning, and Thompson Sampling bandits — with tiered architecture patterns from Meta's ads system and Google's CTR prediction.
The Spectrum of Continual Training — Not Just Batch Retraining
Most MLSD resources describe model deployment as a one-shot event followed by "retrain periodically." Production ML is a continuous process, and the retraining strategy is itself a design decision with major performance and cost implications.
The full spectrum of continual training approaches, from least to most adaptive:
| Approach | Latency of adaptation | Cost | Best for |
|---|---|---|---|
| Full retraining from scratch | Hours to days | High | Models with stable distributions, no concept drift |
| Warm-start fine-tuning | Minutes to hours | Medium | Regular production retraining on updated data |
| Mini-batch online SGD | Seconds to minutes | Low | Fast-changing distributions; streaming updates |
| Per-example online SGD (FTRL) | Milliseconds | Very low | Extreme freshness; feature weights update per event |
| Contextual bandits | Milliseconds | Varies | Exploration-exploitation when labels arrive fast |
The appropriate strategy depends on three factors: label arrival latency (if labels take 30 days, online SGD on fresh labels is impossible), distribution change velocity (fraud patterns change daily; user taste changes monthly), and tolerance for catastrophic forgetting (online models can forget important historical patterns when trained only on recent data).
Most production systems at scale use a tiered strategy: a base model retrained weekly in batch (captures stable long-term patterns), plus a shallow online layer updated continuously (captures recent drift). This hybrid captures the best of both worlds.