Article 12 · Training & Alignment

What is Interpretability?

We can describe what models do. We have almost no idea why they do it. Interpretability is the field trying to change that, and it is harder than it sounds.

The black box problem

After pre-training, fine-tuning, and alignment, you have a model. It answers questions, writes code, translates languages, reasons through problems. You can measure what it does. You cannot see why it does it.

The weights are numbers. The computations are matrix multiplications. Nothing in the network is labelled "this is where the model stores the fact that Paris is the capital of France" or "this is where it decides to be helpful rather than harmful." The architecture tells you how information flows. It does not tell you what the information means.

This is the black box problem. We built the thing. We trained the thing. We do not have source code for what it learned. What we have is a compiled binary, billions of weights set by a training process we designed but did not fully control, doing something we can observe but not yet read.

Interpretability is the field trying to reverse engineer that binary.

Why it matters for alignment

You might reasonably ask: if the model behaves well, does it matter why?

It does, for a specific reason. A model that behaves well in testing might behave differently in deployment, in situations the testing did not cover, for reasons that are completely invisible without interpretability tools. RLHF and CAI shape behaviour against the training distribution. They do not guarantee behaviour outside it.

If you cannot see what the model is actually computing, you cannot verify that alignment held. You can only check whether the outputs looked right, which is not the same thing. A sufficiently capable model could produce outputs that look aligned while doing something internally that is not. This is not science fiction. It is the natural consequence of optimising a system you cannot read.

Interpretability is how you check the work. Not instead of alignment, alongside it.

What is a feature?

Before getting into the mechanics, two definitions that matter.

A feature is a property of the input that the model has learned to detect and represent. Not a neuron, a concept. "Is this text about royalty?" is a feature. "Is this word a verb?" is a feature. "Is this image a curved line?" is a feature. Features are what the model is actually computing in, if you could read them.

The canonical example is the word embedding space, where similar concepts cluster together and relationships between concepts show up as consistent directions. "King" minus "man" plus "woman" produces something very close to "queen." That direction, the one that points from male concepts toward female equivalents, is a feature. The model did not learn this because someone told it to. It learned it because that geometric relationship is present in the statistical patterns of language.

feature_space.diagram

Features are directions. Similar concepts cluster. Relationships are consistent across the space.

What is a circuit?

A circuit is a small, specific set of neurons and connections that work together to implement one behaviour. Think of it as a subroutine, a few components wired together to do one job.

Researchers have found circuits for specific tasks. One that detects when a pattern is being repeated and completes it. One that figures out who a pronoun refers to, tracking the indirect object through a sentence so the model knows that "John gave Mary the book, then she" should continue with a reference to Mary rather than John. Finding these circuits is slow, painstaking work, closer to tracing wires in an old circuit board than to running a search. But it produces real understanding, a genuine account of what specific parts of the network are doing, rather than statistical approximations of behaviour.

The circuit is the unit of analysis that interpretability researchers believe in. Not the individual neuron. Not the full network. The small, readable subgraph.

Finding circuits requires a technique called activation patching. You run the model on two slightly different inputs, take the activations from one forward pass and swap them into the other, then check whether the output changes. If swapping a specific set of activations changes the output in a predictable way, you have found something causally responsible for that behaviour. It is painstaking, one hypothesis at a time, but it produces verified understanding rather than correlation.

Why individual neurons are not the answer

The obvious first guess is that neurons are the meaningful unit. One neuron, one concept. Find the "Paris" neuron and you have found where the model stores that fact.

It does not work that way.

Neurons are polysemantic, meaning a single neuron responds to multiple unrelated concepts. Researchers have found neurons that activate for cats in images, the word "cat" in text, and cat-adjacent concepts like fur and whiskers, but also for entirely unrelated things. Not a cat neuron. A direction in space that happens to be near several different features simultaneously.

This is not a quirk or a failure. It is a consequence of how models store information efficiently, and understanding why requires understanding superposition.

Superposition: the messy bedroom

A model has a fixed number of dimensions to work with. In a typical large model, thousands. Each clean dimension can represent one feature reliably. The problem is the model has learned to detect tens or hundreds of thousands of features, far more than it has clean dimensions for.

So it cheats.

It stores multiple features in the same space by representing them as directions at angles to each other, like hanging extra coats on a coat rack by tilting them so they do not fully overlap. You can still retrieve each coat. They just interfere with each other slightly.

The model's vector space is, in a very real sense, a messy kid's bedroom. Everything is technically in there — the royalty feature, the verb-detection feature, the "this sentence is about Paris" feature — but nothing has a dedicated place. Features are piled on top of each other, overlapping, shoved into whatever corner of the space was available. You can find anything if you know roughly where to look. You cannot point to one spot and say "this is where the royalty feature lives," because the royalty feature is mixed in with several others at a slight angle to all of them.

The model does this because most features are only relevant some of the time. If "mentions medieval history" only fires on a tiny fraction of inputs, the model can afford to store it imprecisely, overlapping with other rarely-used features, without causing much interference in practice. It is a compression strategy. Extraordinarily efficient. A nightmare to reverse engineer.

The interpretability researcher is the parent standing in the doorway trying to figure out why there is a lunchbox in the drawer with all the legos and an old sock. There might be a reason. It is not obvious from the outside.

Two kinds of interpretability

Before getting into approaches, a distinction worth making. Mechanistic interpretability, which is what most of this article covers, tries to understand what is happening inside the weights. Circuits, features, superposition. The goal is to read the internals directly.

Behavioural interpretability takes a different approach entirely. Instead of opening the model up, it probes outputs systematically, asking the model questions designed to reveal its internal structure, looking for patterns in what it gets right and wrong, mapping the shape of its knowledge by testing its edges. Less like surgery, more like an extensive interview. Both approaches are active. They answer different questions.

Three approaches to making it legible

The field has developed three broad strategies for dealing with superposition.

The first is the toy model approach. Build very small, transparent models where you can see everything, then use what you learn to look for the same patterns in larger ones. The insight about superposition came from this method. A model small enough to fully inspect, trained on a simple task, produced clear evidence of features being packed into overlapping directions. That finding then became a lens for looking at larger models. You understand the glass-walled machine completely, then use that understanding as a guide for the opaque one.

The second is closer to being a hardware engineer than an archaeologist. Instead of reverse engineering an existing model, design the architecture from the start to be more interpretable. This is the intuition behind SoLU, a modified activation function developed at Anthropic. Standard activation functions allow neurons to fire in ways that are hard to read. SoLU encourages neurons to be more monosemantic, closer to one neuron, one feature, by suppressing cases where many neurons fire simultaneously. The tradeoff is a small performance cost. The bet is that the interpretability gain is worth more than the capability loss as models scale.

The third approach, and currently the most exciting, is sparse autoencoders. The idea: train a separate, smaller network on top of the model that learns to decompose the messy, superposed activations into cleaner, more monosemantic features. The autoencoder takes the bedroom and tries to sort it. It does not move the lego into a dedicated box or throw out the old sock. It learns to recognise what is what, producing a cleaner map of what features are active for any given input, even though the underlying weights remain unchanged. Anthropic published results in 2023 showing this approach could identify tens of thousands of interpretable features in a real model, far more than previous methods had managed. It is the most concrete progress toward actually solving superposition rather than just describing it.

None of these approaches has cracked interpretability. Together they represent the most promising directions the field has found.

Where this picture breaks down

⚠ Limitations of this picture

Interpretability research has produced real results on small models and specific circuits. Scaling those results to frontier models is an open problem. The circuits found in smaller models do not always have clean equivalents in larger ones, and the superposition problem gets worse as models grow, not better. The tools that work at small scale may not work at the scale that matters most.

The features-as-directions framing is also an approximation. The king/queen/man/woman example is clean and memorable and real. Most features in a trained model are not that clean. They are distributed, context-dependent, and often impossible to label in human terms. The geometry is real. The readability is not guaranteed.

Finally, interpretability research is almost entirely focused on transformer-based language models. As AI systems become more diverse, including multimodal models, agents that act over long time horizons, and architectures that differ significantly from the current generation, the tools developed for today's models may need to be rebuilt from scratch. The field is young enough that its foundational assumptions are still being tested.

What is an AI Agent?