How a Network Learns: Backprop & Gradient Descent
Every neural network starts out completely wrong. Here's the surprisingly simple idea that fixes it.
In the last article we established what a neural network is: a function with billions of weights that maps inputs to outputs. What we didn't answer is how those weights get their values. The answer is training — and it starts with the network being completely, embarrassingly wrong.
The untrained network
When a neural network is first created, its weights are initialised randomly. There's no knowledge in them — just noise. Feed it any input and it will produce a meaningless output. The network isn't broken. It simply hasn't learned anything yet.
We'll use a single running example throughout this article: a neural network that predicts house prices. It takes four inputs — size, bedrooms, distance to city centre, building age — and produces a single number: a predicted sale price.
Predicted price: €47,000 Actual price: €380,000
Error: −€333,000 · random weights, meaningless output
Labelled data and the training signal
To train the network we need examples — lots of them. Each example is a pair: a set of inputs (the four features of a house) and a label (its actual sale price). We might have 100,000 such pairs from real transactions.
Training works by feeding the network one example at a time, letting it make a prediction, then comparing that prediction to the real answer. The gap between prediction and reality is the training signal. Everything that follows is about using that signal to make the weights better.
The loss function: measuring wrongness
We need a single number that captures how wrong the prediction was. That number is called the loss. The formula varies, but the intuition is always the same: the bigger the gap between prediction and reality, the higher the loss. A perfect prediction produces a loss of zero.
The goal of all training is to minimise the loss across the entire dataset. Every other mechanism — gradient descent, backpropagation, learning rate — exists in service of this one objective.
Predicted €374,000 vs actual €380,000: very low loss.
We want the second. Training is the process of getting there.
Backpropagation: assigning blame
We know the prediction was €333,000 wrong. But the network has dozens of weights — the one connecting "bedrooms" to the hidden layer, the one for "location", the one for "size", and so on. Which of them actually caused the error? And by how much?
Backpropagation is the algorithm that answers this. It works backwards through the network from the loss — using the chain rule from calculus — calculating how much each individual weight contributed to the mistake. One backwards pass, all weights assessed simultaneously.
| Input | Current weight | Error signal | Severity | Verdict |
|---|---|---|---|---|
| Size (m²) | −0.31 | −0.74 | way off | |
| Bedrooms | +0.18 | −0.21 | slightly off | |
| Distance to centre | +0.02 | −0.68 | way off | |
| Building age | +0.41 | −0.08 | close |
Backprop assigns an error signal to each weight. Larger magnitude = more responsible for the €333k mistake.
Size and distance to centre are the main culprits — their weights are contributing the most to the error. Building age is barely at fault. Now that we know what's wrong, we can fix it.
Gradient descent: the update rule
Gradient descent is an optimisation algorithm. Its job is to use those error signals from backprop to update every weight in the direction that reduces the loss — and to do this by a controlled amount. It takes the blame assignment from backpropagation and turns it into concrete changes.
In our house model: gradient descent looks at each weight's error signal and asks — should this weight increase or decrease, and by how much? It then nudges every weight simultaneously. One small update across all weights, repeated across the entire dataset, many times over. That's training.
| Input | Old weight | Nudge | New weight | Effect |
|---|---|---|---|---|
| Size (m²) | −0.31 | +0.15 | −0.16 | larger contribution |
| Bedrooms | +0.18 | +0.04 | +0.22 | slight increase |
| Distance to centre | +0.02 | +0.13 | +0.15 | larger contribution |
| Building age | +0.41 | +0.02 | +0.43 | barely touched |
One gradient descent step. Weights most responsible for the error receive the largest nudge. Building age, barely at fault, is barely touched.
This is one update — one step. The model is now slightly less wrong. Run this process across 100,000 house examples, many times over, and the weights gradually converge toward values that produce accurate predictions.
Learning rate: how much to nudge
Each update, gradient descent nudges every weight by a small amount. The learning rate controls how small.
Learning rate too high: we nudge it by +0.80 — overshooting, the model now predicts €720,000 and swings past the actual €380,000 in the other direction. It never settles.
Learning rate too low: we nudge it by +0.001 — the direction is right but each step is so tiny that reaching a good solution takes an impractical number of iterations.
Learning rate well chosen: +0.15 — meaningful progress each step, stable convergence.
What training actually looks like at scale
One full pass through the entire training dataset is called an epoch. You rarely process the whole dataset at once — instead you use batches of examples (say, 32 or 256 houses at a time), running gradient descent after each batch. This is called stochastic gradient descent, and it's faster and more stable than waiting for the full dataset before updating.
Our house price model might train on 100,000 examples over a few epochs, taking minutes on a laptop. Training GPT-4 involved trillions of tokens, thousands of GPUs, and ran for months. The algorithm is identical. Only the scale differs.
Where the analogy breaks down
Real networks have billions of weights, not four. The tables above show four input weights for clarity. In practice a single layer might have millions of connections, and backprop computes gradients for all of them simultaneously.
The loss landscape is not smooth. Real training involves cliffs, plateaus, and local minima. Modern training uses tricks — momentum, adaptive learning rates, weight decay — to navigate this. Gradient descent in its pure form is rarely used alone.
Training doesn't find the global minimum. Models converge to "good enough" solutions, not optimal ones. For very large models, good enough turns out to be remarkably capable.
The numbers here are illustrative. Actual weight values, error signals, and nudges in real networks are not this legible. They're high-dimensional vectors — meaningful to the maths, not to human inspection.