02-backprop-and-gradient-descent.md
~/inside-the-black-box  $  cat  02-backprop-and-gradient-descent.md
back
Article 02  ·  Foundations

How a Network Learns: Backprop & Gradient Descent

Every neural network starts out completely wrong. Here's the surprisingly simple idea that fixes it.

In the last article we established what a neural network is: a function with billions of weights that maps inputs to outputs. What we didn't answer is how those weights get their values. The answer is training — and it starts with the network being completely, embarrassingly wrong.

The untrained network

When a neural network is first created, its weights are initialised randomly. There's no knowledge in them — just noise. Feed it any input and it will produce a meaningless output. The network isn't broken. It simply hasn't learned anything yet.

We'll use a single running example throughout this article: a neural network that predicts house prices. It takes four inputs — size, bedrooms, distance to city centre, building age — and produces a single number: a predicted sale price.

Running example — the untrained model
Input: 120m²  ·  3 bedrooms  ·  2km from centre  ·  15 years old
Predicted price: €47,000    Actual price: €380,000
Error: −€333,000  ·  random weights, meaningless output

Labelled data and the training signal

To train the network we need examples — lots of them. Each example is a pair: a set of inputs (the four features of a house) and a label (its actual sale price). We might have 100,000 such pairs from real transactions.

Training works by feeding the network one example at a time, letting it make a prediction, then comparing that prediction to the real answer. The gap between prediction and reality is the training signal. Everything that follows is about using that signal to make the weights better.

The loss function: measuring wrongness

We need a single number that captures how wrong the prediction was. That number is called the loss. The formula varies, but the intuition is always the same: the bigger the gap between prediction and reality, the higher the loss. A perfect prediction produces a loss of zero.

The goal of all training is to minimise the loss across the entire dataset. Every other mechanism — gradient descent, backpropagation, learning rate — exists in service of this one objective.

Running example — the loss
Predicted €47,000 vs actual €380,000: very high loss.
Predicted €374,000 vs actual €380,000: very low loss.
We want the second. Training is the process of getting there.

Backpropagation: assigning blame

We know the prediction was €333,000 wrong. But the network has dozens of weights — the one connecting "bedrooms" to the hidden layer, the one for "location", the one for "size", and so on. Which of them actually caused the error? And by how much?

Backpropagation is the algorithm that answers this. It works backwards through the network from the loss — using the chain rule from calculus — calculating how much each individual weight contributed to the mistake. One backwards pass, all weights assessed simultaneously.

backprop_blame.diagram
Input Current weight Error signal Severity Verdict
Size (m²) −0.31 −0.74
way off
Bedrooms +0.18 −0.21
slightly off
Distance to centre +0.02 −0.68
way off
Building age +0.41 −0.08
close

Backprop assigns an error signal to each weight. Larger magnitude = more responsible for the €333k mistake.

Size and distance to centre are the main culprits — their weights are contributing the most to the error. Building age is barely at fault. Now that we know what's wrong, we can fix it.

Gradient descent: the update rule

Gradient descent is an optimisation algorithm. Its job is to use those error signals from backprop to update every weight in the direction that reduces the loss — and to do this by a controlled amount. It takes the blame assignment from backpropagation and turns it into concrete changes.

In our house model: gradient descent looks at each weight's error signal and asks — should this weight increase or decrease, and by how much? It then nudges every weight simultaneously. One small update across all weights, repeated across the entire dataset, many times over. That's training.

gradient_descent_update.diagram
Input Old weight Nudge New weight Effect
Size (m²) −0.31 +0.15 −0.16 larger contribution
Bedrooms +0.18 +0.04 +0.22 slight increase
Distance to centre +0.02 +0.13 +0.15 larger contribution
Building age +0.41 +0.02 +0.43 barely touched

One gradient descent step. Weights most responsible for the error receive the largest nudge. Building age, barely at fault, is barely touched.

This is one update — one step. The model is now slightly less wrong. Run this process across 100,000 house examples, many times over, and the weights gradually converge toward values that produce accurate predictions.

Learning rate: how much to nudge

Each update, gradient descent nudges every weight by a small amount. The learning rate controls how small.

Running example — learning rate
Backprop tells us the size weight needs to increase. By how much?

Learning rate too high: we nudge it by +0.80 — overshooting, the model now predicts €720,000 and swings past the actual €380,000 in the other direction. It never settles.

Learning rate too low: we nudge it by +0.001 — the direction is right but each step is so tiny that reaching a good solution takes an impractical number of iterations.

Learning rate well chosen: +0.15 — meaningful progress each step, stable convergence.

What training actually looks like at scale

One full pass through the entire training dataset is called an epoch. You rarely process the whole dataset at once — instead you use batches of examples (say, 32 or 256 houses at a time), running gradient descent after each batch. This is called stochastic gradient descent, and it's faster and more stable than waiting for the full dataset before updating.

Our house price model might train on 100,000 examples over a few epochs, taking minutes on a laptop. Training GPT-4 involved trillions of tokens, thousands of GPUs, and ran for months. The algorithm is identical. Only the scale differs.

The key insight: There is no magic in training. It is a loop: predict, measure wrongness, run backprop to assign blame, use gradient descent to nudge every weight, repeat. What produces intelligence is running this loop billions of times with enough data and enough parameters.

Where the analogy breaks down

⚠ Limitations of this picture

Real networks have billions of weights, not four. The tables above show four input weights for clarity. In practice a single layer might have millions of connections, and backprop computes gradients for all of them simultaneously.

The loss landscape is not smooth. Real training involves cliffs, plateaus, and local minima. Modern training uses tricks — momentum, adaptive learning rates, weight decay — to navigate this. Gradient descent in its pure form is rarely used alone.

Training doesn't find the global minimum. Models converge to "good enough" solutions, not optimal ones. For very large models, good enough turns out to be remarkably capable.

The numbers here are illustrative. Actual weight values, error signals, and nudges in real networks are not this legible. They're high-dimensional vectors — meaningful to the maths, not to human inspection.