Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Reinforcement Learning for ML Systems: Bandits, RLHF, PPO, and DPO

RL concepts that directly appear in production ML interviews: multi-armed bandits for exploration in recommenders, the RLHF pipeline powering ChatGPT and Claude (SFT → reward model → PPO), the PPO objective with KL divergence penalty, DPO as a simpler RLHF alternative, and contextual bandits for content ranking. Focused on practical RL for ML engineers, not robotics.

40 min read 2 sections 1 interview questions
Reinforcement LearningRLHFPPODPOMulti-Armed BanditContextual BanditReward ModelExplorationThompson SamplingUCBPolicy GradientKL DivergenceAlignmentFine-Tuning

RL for ML Engineers: The Practical Subset That Actually Matters

Full reinforcement learning — robotics, game playing, continuous control — is rarely asked in ML engineering interviews outside of specialized roles. What is asked, increasingly at every senior ML interview:

  1. Multi-armed bandits: Online exploration/exploitation in recommendation systems, ad serving, and A/B testing.
  2. RLHF (Reinforcement Learning from Human Feedback): The training pipeline behind ChatGPT, Claude, Gemini — SFT → reward model → PPO.
  3. DPO (Direct Preference Optimization): A simpler alternative to RLHF that skips the RL step.
  4. Contextual bandits: Personalized recommendation with exploration.

Core RL terminology you must know:

  • Policy π(a|s): A function mapping states to actions. In LLM alignment: the policy is the language model, mapping the prompt (state) to the next token (action).
  • Reward r(s, a): Feedback signal after taking action a in state s. In recommendation: click = +1, skip = 0. In RLHF: the reward model's scalar score.
  • Value function V(s): Expected cumulative reward from state s under policy π. Used in actor-critic methods (PPO) to reduce gradient variance.
  • Exploration vs exploitation: Exploration: try new actions to discover potentially better outcomes. Exploitation: take the action that maximizes current expected reward. The fundamental tradeoff in all online learning systems.
  • Regret: Cumulative loss from suboptimal actions compared to the optimal policy. Bandit algorithms are evaluated on total regret over T rounds.
IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.