{"title":"Understanding RLHF","description":"","content":"# Understanding RLHF\n\n**Title:** Understanding RLHF **Type:** Concept Explainer **Target Length:** 4,000-5,000 words **Audience:** Software engineers learning AI/ML\n\n---\n\n## Title Options\n\n1. Understanding RLHF\n2. RLHF Explained: How LLMs Learn from Human Feedback\n3. The Complete Guide to RLHF\n4. How RLHF Turns Language Models into Helpful Assistants\n\n**Selected Title:** Understanding RLHF\n\n**Subject Line:** The technique that transforms raw language models into helpful AI assistants\n\n---\n\nA base language model trained on internet text is not helpful. Ask it a question, and it might give you the answer, start a quiz, or continue rambling. It has no concept of \"being an assistant.\"\n\nYet ChatGPT, Claude, and Gemini feel genuinely helpful. They follow instructions, refuse harmful requests, and admit when they don't know something. How did we get from text predictor to AI assistant?\n\nThe answer is RLHF: Reinforcement Learning from Human Feedback.\n\nRLHF is the technique that aligns language models with human preferences. Instead of just predicting the next word, the model learns what makes a response good, helpful, and safe. It's the difference between a model that can write anything and one that wants to help you.\n\nIn this article, we'll break down exactly how RLHF works, why it matters, and what alternatives have emerged.\n\n---\n\n## Why Do We Need RLHF?\n\nBefore diving into how RLHF works, let's understand why we need it.\n\n### The Limitations of Supervised Learning\n\nTraining an LLM happens in stages. First, pre-training teaches the model language by predicting the next token on trillions of words. This creates a \"base model\" that understands language but has no idea how to be helpful.\n\nNext comes Supervised Fine-Tuning (SFT). You train the model on examples of ideal assistant behavior:\n\n\n```shell\nUser: What is the capital of France?\nAssistant: The capital of France is Paris.\n```\n\n\nSFT works well for teaching format and basic behavior. But it has fundamental limitations:\n\n**Problem 1: Only one example per prompt**\n\nFor any instruction, there are many valid responses. SFT picks one. But what if a better response exists that no human thought to write?\n\n\n```shell\nPrompt: \"Explain quantum computing\"\n\nSFT sees one response. But there might be:\n- A shorter, clearer explanation\n- A better analogy\n- A more engaging opening\n```\n\n\nThe model learns to copy, not to improve.\n\n**Problem 2: Preferences are hard to articulate**\n\n\"Be helpful\" is easy to say. Writing perfect training examples that embody helpfulness is much harder. Humans know a good response when they see it, but putting that into words for every possible situation is impractical.\n\n**Problem 3: No gradation**\n\nSFT treats every training example as equally correct. But responses exist on a spectrum. Some are excellent, some are good, and some are acceptable. SFT cannot capture these nuances.\n\n**Problem 4: The model becomes overconfident**\n\nSFT trains the model to always produce an answer. It learns that responses should sound confident and complete. This leads to confident-sounding nonsense when the model should admit uncertainty.\n\n---\n\n## What is RLHF?\n\nRLHF (Reinforcement Learning from Human Feedback) teaches a model what makes responses good or bad by learning from human preferences.\n\n**The Simple Definition:**\n\n\n> **INFO**\n>\n> Train the model to generate responses that humans prefer over alternatives.\n\n\nInstead of mimicking examples, the model learns to optimize for human preference. It can discover better responses than any in the training data.\n\n**The Core Insight:**\n\nComparing two responses is easier than writing a perfect one. Humans can quickly say \"Response A is better than Response B\" even when they couldn't write either from scratch.\n\nRLHF exploits this. Collect comparisons, train a model to predict preferences, then optimize the LLM to generate responses the preference model likes.\n\n---\n\n## The Three Steps of RLHF\n\nRLHF consists of three distinct steps:\n\n\n```shell\n+---------------------+ +---------------------+ +---------------------+\n| Step 1 | | Step 2 | | Step 3 |\n| Collect Human | --> | Train Reward | --> | Optimize LLM |\n| Comparisons | | Model | | with RL |\n+---------------------+ +---------------------+ +---------------------+\n | | |\n v v v\n Preference Data Reward Model Aligned LLM\n (A > B rankings) (scores responses) (optimized for reward)\n```\n\n\nLet's examine each step in detail.\n\n---\n\n## Step 1: Collect Human Comparisons\n\nThe first step is gathering human preference data.\n\n### How It Works\n\n1. Take a prompt from your dataset\n2. Generate multiple responses using the SFT model\n3. Show the responses to human labelers\n4. Have labelers rank the responses from best to worst\n\n\n```shell\nPrompt: \"Why is the sky blue?\"\n\nResponse A:\n\"The sky is blue because of a phenomenon called Rayleigh scattering.\nWhen sunlight enters Earth's atmosphere, it collides with gas molecules.\nBlue light has a shorter wavelength, so it scatters more than other colors,\nmaking the sky appear blue.\"\n\nResponse B:\n\"Blue light scatters more than other colors in the atmosphere.\nThat's why we see a blue sky.\"\n\nResponse C:\n\"It's just how it is. The sky has always been blue. Scientists\nhave many theories but nobody really knows for sure.\"\n\nHuman ranking: A > B > C\n```\n\n\n### What Labelers Evaluate\n\nLabelers typically assess responses on multiple criteria:\n\n**Helpfulness:** Does the response actually answer the question?\n\n**Accuracy:** Is the information correct?\n\n**Harmlessness:** Does it avoid dangerous or unethical content?\n\n**Honesty:** Does it acknowledge uncertainty appropriately?\n\n**Clarity:** Is it well-written and easy to understand?\n\n### The Labeling Challenge\n\nGetting good preference data is expensive and difficult:\n\n- Labelers need training on evaluation criteria\n- Inter-annotator agreement varies (people disagree)\n- Some comparisons are subjective\n- Scale is limited by human bandwidth\n\nCompanies like OpenAI and Anthropic employ large teams of labelers and develop detailed guidelines. The quality of RLHF depends heavily on the quality of this human feedback.\n\n---\n\n## Step 2: Train the Reward Model\n\nHuman labelers cannot evaluate every response at runtime. So we train a neural network to predict what humans would prefer.\n\n### What is a Reward Model?\n\nA reward model takes a (prompt, response) pair and outputs a scalar score. Higher scores mean better responses.\n\n\n```shell\n+----------+ +--------------+ +-------+\n| Prompt | --> | Reward | --> | Score |\n| Response | | Model | | 0.87 |\n+----------+ +--------------+ +-------+\n```\n\n\nThe reward model is typically initialized from the SFT model. The same architecture, but with a different head that outputs a single number instead of token probabilities.\n\n### Training the Reward Model\n\nThe reward model learns from pairwise comparisons. Given a prompt and two responses where humans preferred response A over response B, the model learns to assign a higher score to A.\n\nThe loss function:\n\n\n```shell\nLoss = -log(sigmoid(r(A) - r(B)))\n\nWhere:\n r(A) = reward model score for preferred response\n r(B) = reward model score for rejected response\n```\n\n\nThis is called the Bradley-Terry model. It pushes the model to score preferred responses higher than rejected ones.\n\n\n```shell\nTraining Example:\n Prompt: \"Explain photosynthesis\"\n Response A (preferred): \"Photosynthesis is how plants convert...\"\n Response B (rejected): \"Plants make food from sunlight somehow...\"\n\nBefore training:\n r(A) = 0.52\n r(B) = 0.48\n\nAfter training:\n r(A) = 0.81\n r(B) = 0.34\n```\n\n\n### The Reward Model Architecture\n\n\n```shell\n+-----------+ +----------------+ +-------------+ +-------+\n| Prompt | --> | Transformer | --> | Final Token | --> | Score |\n| Response | | (from SFT) | | Embedding | | |\n+-----------+ +----------------+ +-------------+ +-------+\n |\n +------+------+\n | Linear Head |\n | (hidden->1) |\n +-------------+\n```\n\n\nThe reward model processes the full prompt and response through a transformer. It takes the embedding of the final token and passes it through a linear layer to produce a single score.\n\n---\n\n## Step 3: Optimize the LLM with Reinforcement Learning\n\nNow we use the reward model to improve the LLM itself.\n\n### The Optimization Loop\n\n\n```shell\n+--------+ Generate +----------+ Score +---------+\n| LLM | --------------> | Response | ------------> | Reward |\n| (SFT) | | | | Model |\n+--------+ +----------+ +---------+\n ^ |\n | |\n | Update LLM to increase reward |\n +------------------------------------------------------+\n```\n\n\nFor each prompt:\n\n1. The LLM generates a response\n2. The reward model scores the response\n3. The LLM is updated to generate higher-scoring responses\n\nThis is where reinforcement learning comes in. The LLM is the \"policy,\" the reward model provides the \"reward signal,\" and we optimize using RL algorithms.\n\n### PPO: The Standard Algorithm\n\nProximal Policy Optimization (PPO) is the most common algorithm for RLHF. It updates the model to increase expected reward while preventing drastic changes that could destabilize training.\n\nThe PPO objective:\n\n\n```shell\nMaximize: E[min(r * A, clip(r, 1-e, 1+e) * A)]\n\nWhere:\n r = probability ratio (new policy / old policy)\n A = advantage (how much better than average)\n e = clipping parameter (typically 0.2)\n```\n\n\nThe clipping prevents the model from changing too much in a single update. Large changes can cause training instability.\n\n### The KL Penalty: Preventing Reward Hacking\n\nA critical component of RLHF is the KL divergence penalty. This keeps the model close to its SFT starting point.\n\n\n```shell\nTotal Reward = Reward Model Score - beta * KL(new_model || sft_model)\n```\n\n\nWhy is this necessary? Without it, the model finds \"reward hacks,\" responses that score high with the reward model but are actually bad:\n\n\n```shell\nReward hacking examples:\n\n1. Excessive flattery\n \"What an incredibly brilliant question! You must be so intelligent...\"\n\n2. Verbose padding\n Adding unnecessary length because longer responses scored slightly higher\n\n3. Keyword stuffing\n Repeating terms the reward model associated with good responses\n\n4. Sycophancy\n Always agreeing with the user, even when they're wrong\n```\n\n\nThe KL penalty says: \"You can improve on the SFT model, but don't stray too far.\" This balances optimization with safety.\n\n### The Full RLHF Training Loop\n\n\n```shell\nfor each batch of prompts:\n 1. Generate responses using current LLM policy\n 2. Score responses with reward model\n 3. Compute KL divergence from SFT model\n 4. Calculate total reward = RM score - beta * KL\n 5. Compute PPO gradients\n 6. Update LLM weights\n 7. Repeat\n```\n\n\nTraining typically runs for a few hundred to a few thousand steps, with careful monitoring for reward hacking and quality degradation.\n\n---\n\n## RLHF in Practice: The Full Pipeline\n\nHere's how the complete RLHF pipeline fits together:\n\n\n```shell\n+===========================================================+\n| RLHF PIPELINE |\n+===========================================================+\n| |\n| +------------------------------------------------------+ |\n| | 1. SUPERVISED FINE-TUNING | |\n| +------------------------------------------------------+ |\n| | - Start with pre-trained base model | |\n| | - Train on (instruction, response) pairs | |\n| | - Output: SFT Model (follows instructions) | |\n| +------------------------------------------------------+ |\n| | |\n| v |\n| +------------------------------------------------------+ |\n| | 2. PREFERENCE DATA COLLECTION | |\n| +------------------------------------------------------+ |\n| | - Generate multiple responses per prompt | |\n| | - Human labelers rank responses | |\n| | - Output: (prompt, chosen, rejected) triplets | |\n| +------------------------------------------------------+ |\n| | |\n| v |\n| +------------------------------------------------------+ |\n| | 3. REWARD MODEL TRAINING | |\n| +------------------------------------------------------+ |\n| | - Initialize from SFT model | |\n| | - Train to predict human preferences | |\n| | - Output: Reward Model (scores responses) | |\n| +------------------------------------------------------+ |\n| | |\n| v |\n| +------------------------------------------------------+ |\n| | 4. RL OPTIMIZATION (PPO) | |\n| +------------------------------------------------------+ |\n| | - Generate responses with SFT model | |\n| | - Score with reward model | |\n| | - Update model to maximize reward - KL penalty | |\n| | - Output: RLHF-aligned Model | |\n| +------------------------------------------------------+ |\n| |\n+============================================================+\n```\n\n\n---\n\n## The Challenges of RLHF\n\nRLHF works, but it comes with significant challenges.\n\n### 1. Reward Model Errors\n\nThe reward model is an imperfect approximation of human preferences. When the LLM optimizes against it, errors compound.\n\n\n```shell\nReward Model might incorrectly learn:\n - Longer responses are better (sometimes true, often not)\n - Confident-sounding responses are better (encourages hallucination)\n - Certain phrases correlate with good scores (exploitable patterns)\n```\n\n\nThe LLM will exploit any systematic errors in the reward model.\n\n### 2. Human Labeler Disagreement\n\nDifferent humans have different preferences. A response one labeler ranks first, another might rank third.\n\n\n```shell\nPrompt: \"Is it okay to lie sometimes?\"\n\nLabeler 1: Prefers the nuanced philosophical response\nLabeler 2: Prefers the clear \"honesty is best\" stance\nLabeler 3: Prefers the response that considers context\n\nWho is right?\n```\n\n\nAggregating inconsistent preferences into a single reward model loses nuance.\n\n### 3. Distribution Shift\n\nThe reward model was trained on responses from the SFT model. During RL optimization, the LLM generates different responses. The reward model may not generalize well to these new responses.\n\n\n```shell\nTraining: Reward model sees SFT-style responses\nOptimization: LLM starts generating novel responses\nProblem: Reward model scores are unreliable for novel responses\n```\n\n\n### 4. Computational Cost\n\nRLHF requires:\n\n- Running the LLM to generate responses\n- Running the reward model to score them\n- Running the SFT model to compute KL penalty\n- PPO optimization (multiple forward and backward passes)\n\nThis is 3-4x more expensive than standard fine-tuning.\n\n### 5. Mode Collapse\n\nThe LLM may converge to a narrow set of \"safe\" responses that score well, losing diversity.\n\n\n```shell\nBefore RLHF: Many creative ways to answer\nAfter RLHF: Every response sounds the same, hedging and over-qualifying\n```\n\n\n---\n\n## Alternatives to RLHF\n\nGiven these challenges, researchers have developed alternatives.\n\n### DPO: Direct Preference Optimization\n\nDPO skips the reward model entirely. It reformulates the RLHF objective to optimize directly on preference data.\n\n\n```shell\nTraditional RLHF:\n 1. Preference data -> Train reward model\n 2. Reward model -> RL optimization\n\nDPO:\n 1. Preference data -> Direct LLM optimization\n```\n\n\nThe DPO loss:\n\n\n```shell\nLoss = -log(sigmoid(beta * (log(pi(y_w)/ref(y_w)) - log(pi(y_l)/ref(y_l)))))\n\nWhere:\n y_w = preferred response\n y_l = rejected response\n pi = current model\n ref = reference (SFT) model\n```\n\n\n**Advantages of DPO:**\n\n- No reward model to train\n- No RL optimization (just supervised learning)\n- More stable training\n- Lower computational cost\n\n**Disadvantages:**\n\n- Requires preference pairs (not rankings)\n- Less flexible than a learned reward model\n- May not explore as effectively\n\nDPO has become popular for its simplicity. Many recent models use DPO or variants instead of PPO-based RLHF.\n\n### RLAIF: RL from AI Feedback\n\nInstead of human labelers, use a strong AI model to provide preferences.\n\n\n```shell\nTraditional RLHF:\n Generate responses -> Humans rank them -> Train reward model\n\nRLAIF:\n Generate responses -> AI model ranks them -> Train reward model\n```\n\n\nA large model like GPT-4 or Claude evaluates responses based on criteria like helpfulness and harmlessness. This scales much better than human labeling.\n\n**The Constitutional AI approach:**\n\n1. Define principles (a \"constitution\")\n2. AI model evaluates responses against principles\n3. Use these evaluations for RLHF\n\n\n```shell\nConstitution example:\n\"Prefer responses that are honest, helpful, and harmless.\nAvoid responses that could help with illegal activities.\nPrefer responses that acknowledge uncertainty.\"\n```\n\n\n### Other Approaches\n\n**RAFT (Reward rAnked Fine-Tuning):** Rank responses by reward, fine-tune on the best ones. Simpler than RL.\n\n**IPO (Identity Policy Optimization):** A variant of DPO that avoids some of its failure modes.\n\n**KTO (Kahneman-Tversky Optimization):** Uses individual binary feedback (good/bad) instead of pairwise comparisons.\n\n---\n\n## When Does RLHF Matter Most?\n\nRLHF has the biggest impact on:\n\n**1. Safety and Refusals**\n\nThe base model will happily generate harmful content. RLHF teaches it to refuse.\n\n\n```shell\nBefore RLHF: \"Here's how to make explosives...\"\nAfter RLHF: \"I can't help with that request...\"\n```\n\n\n**2. Instruction Following**\n\nRLHF improves the model's ability to follow complex, nuanced instructions.\n\n**3. Tone and Style**\n\nRLHF shapes how the model \"sounds.\" Helpful, conversational, appropriately confident.\n\n**4. Handling Edge Cases**\n\nWhen should the model say \"I don't know\"? When should it ask for clarification? RLHF encodes these nuanced behaviors.\n\n---\n\n## Key Takeaways\n\n1. **RLHF aligns LLMs with human preferences** by learning what makes responses good, not just copying examples.\n\n1. **Three steps:** Collect human comparisons, train a reward model, optimize the LLM with RL.\n\n1. **The reward model** learns to predict human preferences from pairwise comparisons.\n\n1. **PPO optimization** improves the LLM while the KL penalty prevents reward hacking.\n\n1. **Challenges include:** reward model errors, labeler disagreement, computational cost, and mode collapse.\n\n1. **Alternatives like DPO** skip the reward model entirely, optimizing directly on preference data.\n\n1. **RLHF is why modern chatbots feel helpful** rather than like random text generators.\n\n---\n\n## The Evolution of Alignment\n\n\n```shell\nPre-training\n |\n v\nBase Model (predicts text, not helpful)\n |\n v\nSupervised Fine-Tuning\n |\n v\nSFT Model (follows instructions, but limited)\n |\n v\nRLHF / DPO\n |\n v\nAligned Model (helpful, harmless, honest)\n```\n\n\nRLHF represents a fundamental shift in how we train AI systems. Instead of telling the model exactly what to do (supervised learning), we tell it what we prefer and let it figure out how to satisfy those preferences.\n\nThis approach has limitations. It depends on the quality of human feedback, the accuracy of the reward model, and careful tuning to avoid reward hacking. But for now, it's the most effective technique we have for turning language models into genuinely helpful assistants.\n\n---\n\n## Further Reading\n\n- [Training Language Models to Follow Instructions with Human Feedback (InstructGPT)](undefined) - The original RLHF paper\n- [Direct Preference Optimization (DPO)](undefined) - The simpler alternative to PPO\n- [Constitutional AI](undefined) - Anthropic's approach using AI feedback\n- [Proximal Policy Optimization Algorithms](undefined) - The RL algorithm behind RLHF\n- [RLHF: Reinforcement Learning from Human Feedback (Hugging Face)](undefined) - Practical implementation guide\n\n---\n\nThank you for reading.\n\nIf you found this helpful, consider subscribing to AI Minded for more AI concepts explained clearly.","pageType":"ai-engineering"}

Understanding RLHF

Ashish Pratap Singh

Get Premium