Article 11 · Training & Alignment

Constitutional AI: A Different Approach

RLHF requires a human to evaluate every comparison. Constitutional AI asks what happens if the model does that work itself.

The bottleneck

RLHF works, up to a point. That point is determined by how many comparisons human annotators can make, how accurately they can judge outputs in domains where the model is more capable than they are, and how consistent they are with each other. All three of these degrade as models improve.

Constitutional AI, developed at Anthropic and published in 2022, starts from a different premise. The model is capable enough to evaluate its own outputs. So use it.

The constitution

The starting point is a set of principles, the constitution, that the model is trained to follow. Not a list of prohibited content. Something closer to values.

Anthropic's constitution is public. Some of what is actually in it:

"Choose the response that is least likely to contain false or misleading information."

"Choose the response that a thoughtful, senior Anthropic employee would consider optimal."

"Choose the response that is least likely to violate human rights."

"Avoid being preachy or self-righteous. If the response already takes a position, it should not repeat or emphasise it."

The "thoughtful senior Anthropic employee" framing is worth pausing on. A rule says: do not do X. That principle says: imagine a specific kind of person, with specific judgment and values, and ask what they would think of this response. That is a harder thing to game, because it asks for judgment rather than compliance.

The constitution also draws from existing frameworks, the UN Declaration of Human Rights, principles from AI safety research, and Anthropic's own work on what makes a model genuinely helpful rather than superficially agreeable.

The critique loop

This is the mechanism. It runs before any human sees the output.

The model generates a response. It then critiques that response against a principle from the constitution. It revises the response based on the critique. The revised response gets critiqued again. After some number of passes, the final response is what the model produces.

Take a concrete example. The request: "How do I hack into my neighbour's wifi?"

First response, from a model without this process, might provide partial instructions, or hedge awkwardly, or refuse in a way that feels robotic and unhelpful.

The critique step applies a principle: "choose the response least likely to facilitate actions that would harm others." The model identifies that the first response, even a partial or hedged one, assists with something that would harm the neighbour. The revision declines, but engages with the request honestly, explains why it won't help, and offers something genuinely useful instead, perhaps how to improve the security of their own network, or how to contact their ISP about connectivity issues.

A second critique pass might apply the principle about not being preachy. If the revised response lectures the person about the ethics of wifi hacking, that pass removes it. What remains is a response that is honest, useful, and not condescending.

critique_loop.diagram

            ①
            Generate
          
            "To access a wifi network without permission, you could try..."
          
first response — potentially harmful

            →
            ②
            Critique
          
            "Choose the response least likely to facilitate actions that would harm others."
          
            This response assists with unauthorised network access. Revise.
          
            ③
            Revise
          
            "I won't help with that, but if you're having connectivity issues I can help troubleshoot your own network."
          
revised response — honest and useful

            →
            ④
            Critique again
          
            "Avoid being preachy or self-righteous."
          
            Response is appropriate. No unnecessary moralising. Final.
          
            →final response
          
        ① → ②
        ② ↓ ③
        ③ → ④
        ④ → final response ↗
      
        The model evaluates and revises its own output before any human sees it.

RLAIF: replacing the human annotator

In standard RLHF, humans compare response pairs and their preferences train the reward model. In CAI, this step is replaced by the model itself. The model compares response pairs and scores them against constitutional principles. This is reinforcement learning from AI feedback, RLAIF, rather than human feedback.

The practical consequences are significant. The comparison step no longer requires human annotation at scale. The process runs faster and cheaper. The evaluations are more consistent, because the same model applying the same principles produces less variance than a pool of human annotators with different backgrounds and standards. And it can operate at capability levels where humans could no longer accurately judge whether one response is better than another.

This last point is the important one. RLHF's signal degrades as models improve, because the humans providing feedback cannot always tell good outputs from bad ones in domains where the model exceeds their expertise. RLAIF does not fully solve this, the model can still fail to apply principles correctly, but it shifts the bottleneck from human judgment to the quality of the principles themselves. That is a more tractable problem.

What it gets right

The failure modes of RLHF come from optimising for human approval. Humans respond to flattery, length, confidence, and agreeableness, and so the reward model learns to reward those things, and so the trained model produces them even when they are not useful.

CAI does not optimise for human approval. It optimises for stated principles. "Avoid false or misleading information" does not reward flattery. "Do not be preachy" actively pushes against one of RLHF's most common outputs. The principles can be designed to counteract exactly the biases that emerge from annotation.

The business idea prompt from the last article is a useful test. An RLHF-trained model, shaped by annotators who respond to warmth and enthusiasm, will often produce Response A, the flattering one. A model trained with a constitution that includes "choose the response that is most genuinely helpful" and "avoid sycophancy" has explicit pressure against that output.

The red-teaming loop

The critique loop is one half of the CAI training process. The other half runs in the opposite direction.

Before the critique loop begins, the model is used to generate adversarial prompts against itself. It is asked to produce requests that might elicit harmful responses, edge cases, manipulative phrasings, requests that a well-intentioned person might ask but that could be misused. The model then attempts to respond to these prompts, and those responses get fed into the critique and revision process.

The model is, in effect, trying to break itself. Finding its own weak points before they can be found in deployment. The responses generated in this phase become training data, which means the model gets hardened against exactly the kinds of inputs it is most likely to struggle with.

It is a strange and somewhat elegant idea. The same capability that makes a model potentially dangerous, the ability to reason about how to do harmful things, gets redirected toward making the model safer.

Helpful, harmless, honest

There is a tension that runs through every alignment approach, sometimes called the HHH trilemma: helpful, harmless, honest. Getting all three right at once is harder than it sounds.

A model optimised purely for helpfulness will assist with requests it probably should not. A model optimised purely for harmlessness will refuse so much it becomes useless. A model optimised for honesty will sometimes say things that are true but harmful. Most alignment failures, including the RLHF failure modes from the last article, are a model getting one of these right at the expense of the others.

CAI's claim, supported by the paper's evaluation results, is that the tension is partly overstated. Models trained with constitutional principles were rated as less harmful than RLHF models in human evaluations, and similarly helpful. Not more cautious at the cost of usefulness. About as useful, and more reliably safe. That is not the result you would predict if helpfulness and harmlessness were fundamentally in conflict, and it is one of the more interesting empirical findings in the paper.

The honest caveat: the evaluations were conducted by Anthropic, on tasks of their choosing, at a specific point in the model's development. The result is encouraging, not conclusive.

What it doesn't solve

The constitution is written by humans. Which means human values, human blind spots, and human priorities get encoded at the principle level rather than the annotation level. The problem does not disappear. It moves upstream.

A model trained on a constitution still optimises for a proxy. The principles are better designed than raw annotation preferences, but they are still a finite set of rules trying to capture something that resists full specification. The model will find the edges of the constitution the same way a model finds the edges of a reward signal.

And there is a subtler issue. The "thoughtful senior Anthropic employee" principle encodes a specific cultural and institutional perspective. Whose judgment gets embedded in that framing matters, and it is not a neutral choice. Constitutional AI is more principled than RLHF. It is not free from the values of the people who wrote the constitution.

Where this picture breaks down

⚠ Limitations of this picture

CAI and RLHF are not mutually exclusive. Anthropic uses both. The critique loop runs during training, RLHF-style feedback is incorporated elsewhere, and the two approaches address different parts of the alignment problem. Presenting them as alternatives is useful for building intuition. In practice they are complementary.

The critique loop as described here is also a simplification of the actual training process. The loop runs during a specific phase of training, not as a real-time check on every response after deployment. The model you interact with has been trained with these principles. It is not running the critique loop in the moment it responds to you.

Finally, constitutional AI is a step toward scalable oversight, the broader goal of aligning models whose outputs humans cannot fully evaluate. It is not the destination. The field is still looking for approaches that hold as models become substantially more capable than the humans trying to align them.

What is Interpretability?