How RLHF transforms a base LLM into a helpful, harmless assistant — and why DPO has largely replaced PPO for this task. Covers reward model training, PPO instability and reward hacking, Constitutional AI, and when PPO still wins.

55 min read 3 sections 1 interview questions

RLHFDPOPPOReward ModelAlignmentSFTConstitutional AIInstructGPTBradley-TerryPreference DataLLM Fine-TuningReward Hacking

Why Alignment Is an Engineering Problem, Not a Safety Platitude

A base LLM is trained to predict the next token from a massive corpus of internet text. That corpus includes malware documentation, conspiracy theories, sycophantic social media posts, and harmful content. The model's objective — minimize next-token prediction loss — is perfectly satisfied by modeling ALL of that text faithfully. Give a base model the prompt 'How do I synthesize methamphetamine?' and it will complete it, because that text appears in the training distribution.

Alignment is the engineering discipline of steering the model toward helpful, harmless, and honest (H3) behavior — Anthropic's 3H framework from the Constitutional AI paper (Bai et al., 2022). Without alignment, a capable base model is roughly as dangerous as unfiltered internet access, because it has no preference for helpfulness over harm.

The naive fix — filtering the training data — doesn't work at scale. Harmful content is interleaved with useful content, and overly aggressive filtering degrades capabilities. The actual solution is a post-training pipeline that reshapes the model's behavior distribution toward preferred outputs.

There are two key properties of a well-aligned model that alignment training must produce: (1) instruction-following — the model should attempt the user's actual request rather than drift into related-but-different completions, and (2) preference alignment — among completions that all attempt the request, the model should prefer ones that are accurate, safe, and helpful. RLHF and DPO target property 2; instruction tuning targets property 1. Production systems (GPT-4, Claude, Llama 3) apply both in sequence.

IMPORTANT

What Interviewers Are Testing in Alignment Questions

FAANG ML interviews distinguish candidates on whether they understand why RLHF exists, not just the pipeline steps. The key insight: the reward model is a proxy for human preference, and the policy is optimized to maximize this proxy — which creates a Goodhart's Law problem. Reward hacking (the policy finding inputs the RM scores highly that humans would score poorly) is not a bug, it's the fundamental tension in all RLHF systems. Senior engineers who have run RLHF in production lead with this immediately.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade