Machine Learning

RLHF (Reinforcement Learning from Human Feedback)

A technique for aligning AI models with human preferences by training reward models on human judgments and using reinforcement learning to optimize for those preferences. Widely used to make language models more helpful, harmless, and honest after initial pre-training.

Why It Matters

RLHF is how modern language models go from 'predicts the next word' to 'follows instructions helpfully and safely.' Understanding RLHF helps governance professionals evaluate claims about AI safety, understand model behavior, and assess alignment approaches.

Example

During Claude's training, human evaluators compare pairs of model responses and indicate which is better. A reward model learns from these preferences, and reinforcement learning optimizes Claude to produce responses that score higher on helpfulness and safety — shaping the model's behavior beyond what pre-training alone could achieve.

Think of it like...

RLHF is like a chef adjusting recipes based on food critic reviews — the base cooking skills (pre-training) are there, but human feedback fine-tunes the dishes to match what people actually want to eat.

Related Terms