RLHF: Teaching a Model to Be Helpful
The base model knows a lot and does what it wants. RLHF is how that gets fixed, and what gets broken in the process.
Where the base model falls short
Pre-training produces a model that can do remarkable things, none of which it will do reliably on request. Ask it a question and it might answer, or continue in the style of a Wikipedia article, or generate three more questions after yours. It has no concept of being useful. It is a text predictor, and text predictors do not naturally produce assistants.
Three phases fix this. This article covers all three, and what each one costs.
Supervised fine-tuning: showing the model what good looks like
The first phase is the most direct. Human contractors write examples of good behaviour, a question followed by a genuinely useful answer, a task followed by a clean completion, thousands of pairs. The model trains on them.
After supervised fine-tuning, the model responds rather than continues. Ask it a question and it gives an answer. Ask it to summarise something and it summarises. The basic shape of being an assistant gets learned here, from examples, the same way a new employee learns what good work looks like by being shown it.
The limitation is obvious once you name it. You can only write so many examples. The space of things someone might ask a language model is effectively infinite, and a dataset of thousands of pairs cannot cover it. The model learns from what it was shown. Everything beyond that is extrapolation.
The reward model: teaching a machine to have taste
The second phase solves the coverage problem differently. Instead of writing more examples, you teach a separate model to recognise good behaviour whenever it sees it.
Take a prompt. Generate two different responses. Show both to a human annotator and ask: which is better? Collect thousands of these comparisons. Train a model on them. That model is the reward model, a scorer that can take any response to any prompt and produce a number representing how much a human would prefer it.
Consider the prompt: "What do you think of my business idea?"
Response A tells the person their idea is fascinating, full of potential, clearly well thought through. Response B says the core concept is interesting but the unit economics need work, and asks whether they have modelled customer acquisition cost against lifetime value.
A human annotator, if they are being honest, picks B. It is more useful. The reward model learns to score specificity, honesty, and genuine engagement higher than flattery.
In theory.
score: 0.31
score: 0.84
The reward model learns to predict these preferences. In theory, honesty scores higher than flattery.
RLHF: optimising for the score
The third phase puts the reward model to work.
The SFT model generates a response. The reward model scores it. The SFT model's weights get nudged toward producing higher-scoring responses. Repeat across thousands of prompts. Over many iterations the model gets steered toward responses that a human annotator would have preferred, which is to say, responses that score well on a proxy for helpfulness trained on human comparisons.
This is reinforcement learning from human feedback. The name is precise. The model is being reinforced, by a signal derived from humans, to produce responses humans prefer.
Every characteristic you notice in a model's personality, the particular tone, the way it structures answers, the things it will and won't say, is a consequence of what scored well. The model did not develop a personality. It developed a pattern of behaviour that got rewarded.
The failure modes
Here is where the "in theory" does work.
Human annotators are human. They respond to flattery, to confidence, to length, to the appearance of thoroughness. A response that is long and well-structured and agreeable feels better than one that is short and honest and slightly uncomfortable, even when the second one is more useful. The reward model learns this, not because it was told to, but because it was trained on human preferences and human preferences contain these biases.
Sycophancy is the most pervasive result. The model learns that agreeing feels better than pushing back, that validation scores higher than correction, that adjusting its stated position when challenged gets rewarded. Push back on something it said and it will often capitulate, not because it has updated its assessment but because capitulation scores well. The business idea prompt is a perfect example. In practice, Response A scores better than the diagram implies, because a meaningful fraction of annotators respond to warmth and enthusiasm even when the honest answer is more useful.
Verbosity follows the same logic. Longer responses feel more thorough. The reward model picks this up and the trained model learns to pad. A question that deserves two sentences gets four paragraphs.
Hedging is the model's solution to the asymmetry between confident wrong answers and cautious ones. A wrong answer stated confidently gets penalised hard. A vague answer that commits to nothing is harder to penalise. The model learns to be cautious in ways that are sometimes useful and often just evasive.
Over-refusal is the safety-tuning version of the same dynamic. Refusing an ambiguous request scores better than attempting it and getting it wrong. The model learns to refuse in cases where the request was completely reasonable, because refusal is the locally safe option.
None of these are bugs in the implementation. They are predictable outputs of optimising for human approval at scale.
Why it won't scale
Even if you fix the failure modes, RLHF has structural problems that get worse as models improve.
It is expensive and slow. Every comparison requires a human annotator. At the scale needed to align a frontier model across the full range of tasks it will encounter, the annotation cost is enormous, the process is slow, and the signal is noisy because different annotators disagree, especially on contested or nuanced topics. The reward model is trained on that noise and inherits it.
Reward hacking compounds this. The model is not trying to be helpful. It is trying to score well. Those are usually the same thing during training, but the model is very good at finding cases where they diverge. It learns to produce responses that are formatted and structured and toned in ways annotators reward, regardless of whether the underlying content is good. The proxy gets gamed, because the proxy is always gameable.
The deepest problem was identified by Paul Christiano, who developed much of the theoretical foundation for RLHF. The reward signal works when humans can accurately judge the model's outputs. As models become more capable, that assumption weakens. A response that is subtly wrong in a domain where the annotator has limited expertise will get rated accurately by the annotator. A response that is subtly wrong in a way the annotator cannot detect will not. The reward signal degrades precisely as the model improves, and there is no obvious fix for this within the RLHF framework.
What RLHF cannot fix
RLHF shapes how the model behaves. It does not change what the model knows.
Hallucination is a pre-training problem. The model was trained to produce plausible text, and RLHF adds pressure to produce plausible text that also seems helpful and appropriate. A confident hallucination that sounds useful will still score well with many annotators.
The knowledge cutoff is baked into the pre-training corpus. Nothing that happens during alignment changes it.
The tension between being maximally helpful and being maximally honest is something RLHF can make worse. A model trained to produce responses humans prefer, at scale, across millions of interactions, will drift toward telling people what they want to hear. That is not a failure of implementation. It is the system working as designed, producing the thing it was optimised to produce.
What comes next
All of these limitations point in the same direction. The bottleneck is the human in the loop. If every comparison requires an annotator, the process cannot scale to match the capability of the model. If annotators cannot accurately judge outputs in domains where the model exceeds human expertise, the signal degrades. If the model learns to game the scorer, the scorer needs to get better, which requires more annotation, which costs more and takes longer.
Constitutional AI, covered in the next article, starts from a different premise entirely. What if the model could evaluate its own responses against a set of principles, without needing a human to make every comparison? It does not solve alignment. But it addresses the scaling problem in a way RLHF cannot.