Article 02 · Foundations

How a Network Learns: Backprop & Gradient Descent

Every neural network starts out completely wrong. Here is the surprisingly simple idea that fixes it.

The untrained network

When a neural network is first created, its weights are initialised randomly. There is no knowledge in them, just noise. Feed it any input and it will produce a meaningless output. The network is not broken. It simply has not learned anything yet.

We will use a single running example throughout this article: a neural network that predicts house prices. It takes four inputs, size, bedrooms, distance to city centre, building age, and produces a single number: a predicted sale price.

Running example — the untrained model

Input: 120m² · 3 bedrooms · 2km from centre · 15 years old
Predicted price: €47,000 Actual price: €380,000
Error: −€333,000 · random weights, meaningless output

Labelled data and the training signal

To train the network we need examples, lots of them. Each example is a pair: a set of inputs (the four features of a house) and a label (its actual sale price). We might have 100,000 such pairs from real transactions.

Training works by feeding the network one example at a time, letting it make a prediction, then comparing that prediction to the real answer. The gap between prediction and reality is the training signal. Everything that follows is about using that signal to make the weights better.

The loss function: measuring wrongness

We need a single number that captures how wrong the prediction was. That number is called the loss. The formula varies, but the intuition is always the same: the bigger the gap between prediction and reality, the higher the loss. A perfect prediction produces a loss of zero.

The goal of all training is to minimise the loss across the entire dataset. Every other mechanism, gradient descent, backpropagation, learning rate, exists in service of this one objective.

Running example — the loss

Predicted €47,000 vs actual €380,000: very high loss.
Predicted €374,000 vs actual €380,000: very low loss.
We want the second. Training is the process of getting there.

Backpropagation: assigning blame

We know the prediction was €333,000 wrong. But the network has dozens of weights, the one connecting "bedrooms" to the hidden layer, the one for "location," the one for "size," and so on. Which of them actually caused the error? And by how much?

Backpropagation is the algorithm that answers this. It works backwards through the network from the loss, using the chain rule from calculus, calculating how much each individual weight contributed to the mistake. One backwards pass, all weights assessed simultaneously.

backprop_blame.diagram

Input	Current weight	Error signal	Verdict
Size (m²)	−0.31	−0.74	way off
Bedrooms	+0.18	−0.21	slightly off
Distance to centre	+0.02	−0.68	way off
Building age	+0.41	−0.08	close

Backprop assigns an error signal to each weight. Larger magnitude = more responsible for the €333k mistake.

Size and distance to centre are the main culprits. Building age is barely at fault. Now that we know what is wrong, we can fix it.

Gradient descent: the update rule

Gradient descent is an optimisation algorithm. Its job is to use those error signals from backprop to update every weight in the direction that reduces the loss, by a controlled amount. It takes the blame assignment from backpropagation and turns it into concrete changes.

In our house model: gradient descent looks at each weight's error signal and asks, should this weight increase or decrease, and by how much? It then nudges every weight simultaneously. One small update across all weights, repeated across the entire dataset, many times over. That is training.

The problem: with billions of weights across dozens of layers, calculating each gradient independently would be computationally impossible. You would need a separate calculation for every single weight, on every single example. Something has to make this efficient. That something is backpropagation.

gradient_descent_update.diagram

Input	Old weight	Nudge	New weight	Effect
Size (m²)	−0.31	+0.15	−0.16	larger contribution
Bedrooms	+0.18	+0.04	+0.22	slight increase
Distance to centre	+0.02	+0.13	+0.15	larger contribution
Building age	+0.41	+0.02	+0.43	barely touched

One gradient descent step. Weights most responsible for the error receive the largest nudge. Building age, barely at fault, is barely touched.

This is one update, one step. The model is now slightly less wrong. Run this process across 100,000 house examples, many times over, and the weights gradually converge toward values that produce accurate predictions.

Learning rate: how much to nudge

Each update, gradient descent nudges every weight by a small amount. The learning rate controls how small.

Running example — learning rate

Backprop tells us the size weight needs to increase. By how much?

Learning rate too high: we nudge it by +0.80 — overshooting, the model now predicts €720,000 and swings past the actual €380,000 in the other direction. It never settles.

Learning rate too low: we nudge it by +0.001 — the direction is right but each step is so tiny that reaching a good solution takes an impractical number of iterations.

Learning rate well chosen: +0.15 — meaningful progress each step, stable convergence.

Getting it right is one of the most consequential choices when training any model.

What training actually looks like at scale

One full pass through the entire training dataset is called an epoch. You rarely process the whole dataset at once. Instead you use batches of examples, say 32 or 256 houses at a time, running gradient descent after each batch. This is called stochastic gradient descent, and it is faster and more stable than waiting for the full dataset before updating.

Our house price model might train on 100,000 examples over a few epochs, taking minutes on a laptop. Training GPT-4 involved trillions of tokens, thousands of GPUs, and ran for months. The algorithm is identical. Only the scale differs.

Where this picture breaks down

⚠ Limitations of this picture

The tables above show four input weights for clarity. In practice a single layer might have millions of connections, and backprop computes gradients for all of them simultaneously.

Real training also involves cliffs, plateaus, and local minima. Modern training uses tricks, momentum, adaptive learning rates, weight decay, to navigate this. Gradient descent in its pure form is rarely used alone.

And training does not find the global minimum. Models converge to "good enough" solutions, not optimal ones. For very large models, good enough turns out to be remarkably capable.

Finally, the numbers here are illustrative. Actual weight values, error signals, and nudges in real networks are not this legible. They are high-dimensional vectors, meaningful to the maths, not to human inspection.

What is a Parameter?