Article 07 · Module 02 — Transformer Architecture

Why Transformers Beat Everything Before Them

The transformer wasn't the cleverest architecture ever proposed. It was the one that scaled. Here is why that turned out to matter more than anything else.

What came before and why it wasn't enough

Recurrent neural networks were not bad. For most of the 2010s they were the best tool available for language tasks, translation, summarisation, question answering. They worked. Researchers built real products with them.

The problem was the ceiling. The fading memory problem from Article 04 meant that long sequences were always a struggle. The sequential processing requirement meant training was slow, you could not parallelise across a sequence because each step depended on the one before it. And slow training meant you could not use more data, because more data requires more training time. The architecture and the available compute were mismatched in a way that had no clean solution within the RNN framework.

LSTMs, long short-term memory networks, a more sophisticated variant of the RNN, helped. They introduced gating mechanisms, controlled switches that decide what information to keep and what to discard as the model reads along a sequence. They extended the effective memory window considerably. But they did not solve the parallelisation problem, and they added enough complexity that they were harder to scale and tune. The ceiling rose slightly. It did not disappear.

The three things transformers got right simultaneously

The 2017 paper Attention Is All You Need proposed a different architecture built entirely around the attention mechanism from Article 05. It got three things right at once, and the combination was what mattered.

First, attention solved the long-range dependency problem cleanly. Every token can attend directly to every other token regardless of distance. The trophy and the pronoun referring to it five words later are no further apart than adjacent words.

Second, because attention computations are independent of each other, the entire sequence can be processed in parallel. This meant transformers could use GPU hardware, which is designed for doing many simple computations simultaneously, far more efficiently than RNNs ever could. Training that would have taken weeks could now take days.

Third, and least obviously, the architecture turned out to be general. Attention is a mechanism for finding relationships in sequences. Almost any data can be framed as a sequence, words, image patches, audio frames, protein residues, lines of code. The same architecture that translated text in 2017 now runs across most of modern AI.

No single one of these was a decisive advantage. Together they were.

The components you already know

Tokens are the input. A sentence gets broken into pieces, sometimes words, sometimes parts of words, and each piece becomes a token. "The cat sat on the mat" might become six tokens, one per word.

Positional encoding is a small addition to each token that tells the model where in the sequence it sits. Attention on its own has no sense of order, "cat sat" and "sat cat" would look identical without it. Positional encoding solves this by injecting position information directly into the token's representation before any attention is computed.

The attention mechanism, covered in Articles 05 and 06, is the core operation. Each token generates a query, a key, and a value. Queries match against keys to produce scores. Scores weight the blending of values. The result is a representation of each token shaped by its full context.

Feed-forward layers sit after the attention step in each block. Attention does relational work, figuring out how tokens relate to each other. The feed-forward layer does additional computation on each token individually. Attention gathers context. The feed-forward layer processes it.

The whole block gets repeated many times. Each repetition is a layer. More layers means more passes of refinement before the output is produced. A small model might have 12 layers. A large one might have 96.

transformer_architecture.diagram

Encoder

Input tokens (e.g. "The cat sat")

↓

Embedding + Positional Encoding tokens become vectors; position added

↓

Encoder Block × N

Multi-Head Self-Attention → Articles 05 & 06

Feed-Forward Layer → Article 01

↓

Encoded representation rich understanding of input

Decoder

Output tokens so far (e.g. "Le chat")

↓

Embedding + Positional Encoding

↓

Decoder Block × N

Masked Self-Attention can only see past tokens

Cross-Attention ← reads from encoder

Feed-Forward Layer

↓

Next token (probability) → "assis"

Encoded representation → Cross-Attention

The original transformer. Encoder reads and understands; decoder generates one token at a time, consulting both what it has generated and what the encoder understood. Modern LLMs use decoder-only.

The translation example: encoder and decoder

The original transformer was built for translation, specifically translating English into French. It had two components: an encoder and a decoder, each built from the blocks described above.

Take the sentence "The cat sat on the mat." The encoder reads the full English sentence and builds a rich representation of what it means, using attention to connect every word to every other word, resolving relationships, producing a dense understanding of the input. The encoder does not generate anything. Its job is comprehension.

The decoder then generates the French translation one token at a time: "Le," then "chat," then "s'est," and so on. At each step it does two things. First, it attends to the French tokens it has already generated, this is self-attention, masked so the model cannot cheat by looking at future tokens. Second, it attends to the encoder's output, this is cross-attention, and it is how the decoder keeps consulting the original English meaning as it generates each new French word.

The encoder/decoder split exists because translation is a two-act process: fully read the source language, then generate the target. Comprehension first, generation second. Two separate jobs, two separate components.

Why modern LLMs dropped the encoder

Translation is a two-act process. You need a dedicated component to read the source fully before the other component can generate the target. Encoder and decoder, two separate jobs.

Language modelling is not. There is only one stream of text and one job: predict what comes next. The decoder can attend back to everything it has already seen in the sequence. By the time it is predicting token 50, it has attended to tokens 1 through 49, which is all the context it needs. Nothing needs to be encoded into a separate representation first. Comprehension and generation are the same operation, happening in the same component.

Decoder-only is not a simplification. It is the right architecture for the task.

You might be wondering: but ChatGPT and Claude can translate too, so what gives? They can, but they do it differently. Not by explicitly encoding a source language and decoding a target, but because exposure to vast amounts of multilingual text during pre-training made translation an emergent capability. The model has seen so much English and French side by side that it can continue a prompt that says "translate this" without needing a dedicated encoder. Same result, different mechanism.

Scale as an unexpected force

Nobody predicted this part. As transformers got bigger, more layers, more parameters, more training data, they did not just get incrementally better at existing tasks. They developed qualitatively new capabilities that smaller models simply did not have.

A model with 1 billion parameters can complete sentences. A model with 10 billion parameters can answer questions. A model with 100 billion parameters can write code, reason through multi-step problems, and translate languages it was never explicitly trained on. Not improvements in degree. Differences in kind. The same architecture, scaled up, starts doing things that feel categorically different.

This phenomenon, sometimes called emergent capability, is still not fully understood. What is clear is that the transformer architecture did not just benefit from scale. It seemed to unlock something new as scale increased, in a way that earlier architectures did not.

What transformers are still bad at

The transformer won. It has not solved intelligence.

Attention scales quadratically with sequence length, meaning double the tokens, roughly four times the computation. Context windows are not unlimited. Hallucination, the model confidently producing false information, is a direct consequence of the architecture: the model generates the most statistically plausible next token, not the most factually accurate one. It has no mechanism for checking whether what it is saying is true.

Transformers have no persistent memory. Every conversation starts fresh. Every context window is a clean slate. A human expert who has spent forty years studying a subject brings all of that to every sentence they read. A transformer brings only what fits in the current context window.

Transformers are also extraordinarily data-hungry compared to humans. A child learns the concept of a chair from a handful of examples. A transformer requires exposure to millions of sentences containing the word before its representation stabilises. Powerful, but not efficient in the way biological intelligence is efficient.

These are not arguments against transformers. They are the shape of the problem that the next decade of research is organised around.

Where this picture breaks down

⚠ Limitations of this picture

The "three things at once" framing makes the transformer's success sound inevitable in retrospect. It was not. Attention mechanisms existed before 2017 and were used as additions to RNNs rather than replacements for them. The insight that you could discard recurrence entirely and build purely on attention was a genuine leap, not an obvious next step.

The encoder/decoder diagram is the original 2017 architecture. Modern transformers have diverged considerably, different positional encoding schemes, different normalisation approaches, different attention variants. The diagram is a map of the concept, not a blueprint for any specific model in production today.

The emergent capability story is real but contested. Some researchers argue that apparent emergence is partly an artefact of how capability is measured, that smoother, more gradual improvement becomes visible as a sharp jump when you change the metric. The debate is unresolved. What is not contested is that large transformers do things small transformers cannot.

Pre-training: Learning from the Whole Internet