Backpropagation: The Engine of Learning

Deep learning models learn by minimizing a loss function. But how do we know which weights to adjust, and by how much? This is where Backpropagation (backward propagation of errors) comes in. It is the mathematical engine that powers modern AI, allowing us to efficiently calculate gradients for millions of parameters.

1. The Intuition: Credit Assignment

Imagine you are managing a large corporation. The CEO (Loss Function) realizes the company lost money this quarter. To fix this, they need to know which department is responsible.

Forward Pass: Information flows from employees to managers to the CEO.
Backward Pass: The CEO sends a signal back: “We lost $1M.”
- The VP of Sales realizes they missed targets by 10%, so they accept some blame.
- The VP of Engineering realizes they delayed the product, so they accept some blame.
- This “blame” (gradient) trickles down to individual teams and employees.

In a neural network, backpropagation assigns “blame” for the error to each weight in the network. A large gradient means a weight significantly contributed to the error and should be adjusted.

2. The Math: The Chain Rule

Backpropagation is simply the recursive application of the Chain Rule from calculus.

If we have a function y = f(g(x)), and we want to find the derivative of y with respect to x (&partial;y/&partial;x), we multiply the derivatives of the intermediate steps:

&partial;y/&partial;x = (&partial;y/&partial;g) \times (&partial;g/&partial;x)

In a Neural Network

Consider a simple path: Input x → Weight w → Linear Unit z = w · x → Activation a = σ(z) → Loss L.

We want to find &partial;L/&partial;w (how much the Loss changes when we change the weight).

Using the Chain Rule:

&partial;L/&partial;w = (&partial;L/&partial;a) \times (&partial;a/&partial;z) \times (&partial;z/&partial;w)

Let’s break it down:

&partial;L/&partial;a: How the loss changes with the activation (depends on the Loss function, e.g., MSE).
&partial;a/&partial;z: The derivative of the activation function (e.g., Sigmoid derivative).
&partial;z/&partial;w: The derivative of the linear unit with respect to weight. Since z = w · x, this is simply x.

3. Interactive: Computational Graph Playground

Visualize how gradients flow backward through a simple computational graph: L = (w · x + b - y_target)².

Input (x): 1.0

Weight (w): 0.5

Bias (b): 0.0

Target (y): 1.0

z = wx+b

Loss

→

Prediction: 0.50

Loss: 0.25

dL/dw = ?

dL/db = ?

4. Implementation from Scratch

Let’s implement a simple 2-layer neural network in Python using only NumPy.

The Algorithm

Initialize weights randomly.
Forward: Compute prediction.
Loss: Compute error.
Backward: Compute gradients.
Update: w = w - α × gradient.

Python Code (NumPy)

import numpy as np

# Sigmoid activation and its derivative
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
  return x * (1 - x)

# Input dataset (XOR problem)
inputs = np.array([[0,0], [0,1], [1,0], [1,1]])
expected_output = np.array([[0], [1], [1], [0]])

epochs = 10000
lr = 0.1
inputLayerNeurons, hiddenLayerNeurons, outputLayerNeurons = 2, 2, 1

# Random weights and bias initialization
hidden_weights = np.random.uniform(size=(inputLayerNeurons, hiddenLayerNeurons))
hidden_bias = np.random.uniform(size=(1, hiddenLayerNeurons))
output_weights = np.random.uniform(size=(hiddenLayerNeurons, outputLayerNeurons))
output_bias = np.random.uniform(size=(1, outputLayerNeurons))

for _ in range(epochs):
  # --- Forward Propagation ---
  hidden_layer_activation = np.dot(inputs, hidden_weights) + hidden_bias
  hidden_layer_output = sigmoid(hidden_layer_activation)

  output_layer_activation = np.dot(hidden_layer_output, output_weights) + output_bias
  predicted_output = sigmoid(output_layer_activation)

  # --- Backpropagation ---
  # 1. Calculate Error
  error = expected_output - predicted_output

  # 2. Derivative of Loss w.r.t Output (Chain Rule Part 1)
  d_predicted_output = error * sigmoid_derivative(predicted_output)

  # 3. Calculate Error at Hidden Layer
  error_hidden_layer = d_predicted_output.dot(output_weights.T)

  # 4. Derivative of Hidden Layer (Chain Rule Part 2)
  d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)

  # --- Updating Weights and Biases ---
  output_weights += hidden_layer_output.T.dot(d_predicted_output) * lr
  output_bias += np.sum(d_predicted_output, axis=0, keepdims=True) * lr
  hidden_weights += inputs.T.dot(d_hidden_layer) * lr
  hidden_bias += np.sum(d_hidden_layer, axis=0, keepdims=True) * lr

print("Final hidden weights: \n", hidden_weights)
print("Final output weights: \n", output_weights)
print("Predicted Output: \n", predicted_output)

[!TIP] Vanishing Gradients: If you use sigmoid activations in deep networks, gradients can become very small (< 0.25), causing early layers to stop learning. This is why ReLU is preferred for hidden layers.

5. Summary

Backpropagation is the bridge between the error and the parameters. It tells each weight exactly how much it contributed to the error, allowing us to systematically improve the model.