RLHF & DPO: Alignment Techniques.
Teaching models not just what to say, but what humans prefer.
After this lesson you'll know
- How RLHF works: reward models, PPO, and the full training pipeline
- DPO: Direct Preference Optimization as a simpler alternative to RLHF
- How to collect and format preference data
- When to use alignment techniques vs standard supervised fine-tuning
The Alignment Problem
Supervised fine-tuning (SFT) teaches a model what correct outputs look like. But many tasks do not have a single correct answer -- they have better and worse answers along subjective dimensions: helpfulness, safety, tone, completeness. Alignment techniques teach models to prefer better outputs over worse ones, based on human judgment. ``` SFT training data: Input: "Explain quantum computing" Output: [one correct answer] Preference training data: Input: "Explain quantum computing" Chosen: [clear, accurate, well-structured explanation] Rejected: [technically correct but confusing, overly verbose] ``` The model learns not just to generate plausible outputs, but to generate outputs that humans would prefer. This is the difference between a model that can answer and a model that answers well. Two dominant approaches exist: RLHF (the original technique used by OpenAI for ChatGPT) and DPO (a simpler alternative that achieves similar results without reinforcement learning).
Alignment training happens AFTER supervised fine-tuning. The typical pipeline is: Pre-training -> SFT (teach the task) -> Alignment (refine preferences). Skipping SFT and going straight to alignment produces poor results because the model needs basic task competence before it can learn preferences.
RLHF: The Full Pipeline
Reinforcement Learning from Human Feedback is a three-stage process: **Stage 1 - Supervised Fine-Tuning (SFT):** Standard fine-tuning on high-quality examples (covered in Lessons 2-5). This produces a model that can perform the task. **Stage 2 - Reward Model Training:** Train a separate model to predict human preferences. This model takes a prompt + response and outputs a scalar score. ``` Training data format: Prompt: "Write a product description for wireless earbuds" Response A: [detailed, engaging, accurate] → Score: 4.5 Response B: [brief, bland, has errors] → Score: 2.1 The reward model learns to assign higher scores to responses that humans would prefer. ``` **Stage 3 - RL Optimization (PPO):** Use the reward model as the objective function. Generate responses from the SFT model, score them with the reward model, and update the SFT model to maximize the reward score using Proximal Policy Optimization. ``` PPO training loop: 1. Sample prompt from dataset 2. Generate response from current policy (SFT model) 3. Score response with reward model 4. Compute KL divergence penalty (prevent drift from SFT model) 5. Update policy using PPO to maximize: reward - beta * KL 6. Repeat ``` **RLHF complexity breakdown:** ``` Components needed: - SFT model (your fine-tuned model) - Reward model (separate model, often same size as SFT) - Reference model (frozen copy of SFT, for KL penalty) - PPO optimizer VRAM required: 3-4x a single model (SFT + reward + reference) Training stability: Fragile. Hyperparameters are sensitive. Implementation difficulty: High. Many moving parts. ```
The practical reality: Full RLHF is complex, expensive, and unstable. This is why DPO was invented. Unless you are building a general-purpose assistant at scale, DPO is almost always the better choice.
This lesson is for Pro members
Unlock all 518+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.