Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
Neural Networks: Backpropagation, Activations & Training
Machine Learning
Transformers: Self-Attention, Architecture & Modern LLMs
Machine Learning
A/B Testing for ML Systems: Design, Statistical Rigor & Production Pitfalls
ML System Design
Offline vs Online Evaluation: Why Metrics Disagree and What to Do About It
ML System Design
Reinforcement Learning for ML Systems: Bandits, RLHF, PPO, and DPO
RL concepts that directly appear in production ML interviews: multi-armed bandits for exploration in recommenders, the RLHF pipeline powering ChatGPT and Claude (SFT → reward model → PPO), the PPO objective with KL divergence penalty, DPO as a simpler RLHF alternative, and contextual bandits for content ranking. Focused on practical RL for ML engineers, not robotics.
RL for ML Engineers: The Practical Subset That Actually Matters
Full reinforcement learning — robotics, game playing, continuous control — is rarely asked in ML engineering interviews outside of specialized roles. What is asked, increasingly at every senior ML interview:
- Multi-armed bandits: Online exploration/exploitation in recommendation systems, ad serving, and A/B testing.
- RLHF (Reinforcement Learning from Human Feedback): The training pipeline behind ChatGPT, Claude, Gemini — SFT → reward model → PPO.
- DPO (Direct Preference Optimization): A simpler alternative to RLHF that skips the RL step.
- Contextual bandits: Personalized recommendation with exploration.
Core RL terminology you must know:
- Policy π(a|s): A function mapping states to actions. In LLM alignment: the policy is the language model, mapping the prompt (state) to the next token (action).
- Reward r(s, a): Feedback signal after taking action a in state s. In recommendation: click = +1, skip = 0. In RLHF: the reward model's scalar score.
- Value function V(s): Expected cumulative reward from state s under policy π. Used in actor-critic methods (PPO) to reduce gradient variance.
- Exploration vs exploitation: Exploration: try new actions to discover potentially better outcomes. Exploitation: take the action that maximizes current expected reward. The fundamental tradeoff in all online learning systems.
- Regret: Cumulative loss from suboptimal actions compared to the optimal policy. Bandit algorithms are evaluated on total regret over T rounds.