Multivariable Calculus: The Gradient Vector
1. Introduction: Beyond y=f(x)
Real-world problems rarely depend on a single variable.
- House Price: Depends on (Size, Rooms, Location, Age).
- Neural Network Loss: Depends on millions of weights (w1, w2, …, wn).
We need calculus for functions with vector inputs: f(x).
2. Partial Derivatives
If z = f(x, y) = x2 + y2, how does z change? It depends on which direction you move!
- Partial with respect to x (∂f / ∂x): Treat y as a constant (like slicing the mountain along the East-West axis). Differentiate x.
- Partial with respect to y (∂f / ∂y): Treat x as a constant (slicing North-South). Differentiate y.
Example: f(x, y) = 3x2y
- ∂f / ∂x = 6xy (Treat y as constant like 5).
- ∂f / ∂y = 3x2 (Treat x as constant like 5).
3. The Gradient Vector (∇f)
If we collect all partial derivatives into a vector, we get the Gradient:
∇f(x, y) = [ ∂f / ∂x, ∂f / ∂y ]
Properties of the Gradient
- It is a Vector (has direction and magnitude).
- It points in the Direction of Steepest Ascent (Uphill).
-
Its magnitude ∇f tells you how steep the slope is.
[!TIP] Gradient Descent: To find the minimum (bottom of the valley), we go in the opposite direction of the gradient: -∇f.
4. The Matrix Derivatives
In Deep Learning, we deal with layers of neurons, so we need matrices.
4.1 The Jacobian Matrix (J) - The Slope Map
If we have a function mapping a vector to a vector (f: Rn → Rm), the first derivative is an m × n matrix called the Jacobian.
- Shape: (Output Dim) × (Input Dim).
- Usage: Used for Backpropagating errors through a layer. It tells us how every output changes with respect to every input.
4.2 The Hessian Matrix (H) - The Curvature Map
If we have a scalar function (L: Rn → R), the second derivative is an n × n symmetric matrix called the Hessian.
- Hij = ∂2L / ∂wi∂wj
- Shape: (Input Dim) × (Input Dim).
- Usage: Determines Curvature (Bowl vs Saddle).
- Positive Definite H: Valley (Minimum). We want to be here.
- Indefinite H: Saddle Point. We want to escape this.
- Newton’s Method uses the inverse of the Hessian to jump to the minimum, but it’s too expensive ($O(N^3)$) for Deep Learning.
5. Interactive Visualizer: The Gradient Compass
The background shows a “Hill” function: z = 4 - (x2 + y2).
- Bright Center: Peak (High Z).
- Dark Edges: Valley (Low Z).
Interaction:
- Move Mouse: The Red Arrow represents the Gradient ∇f. It always points Uphill (towards the center peak).
- Click: Spawn a “Ball” that rolls Downhill (opposite to the gradient).
- Unlike simple Gradient Descent, these balls have Momentum (Mass). They accelerate down the slope (
v += a), oscillate, and eventually settle due to friction.
- Unlike simple Gradient Descent, these balls have Momentum (Mass). They accelerate down the slope (
6. Summary
- Partial Derivative: Slope along one axis (Sensitivity to one input).
- Gradient Vector: Combined direction of steepest ascent. We move against it to learn.
- Jacobian Matrix: The derivatives of a vector output (The Slope Map).
- Hessian Matrix: The derivatives of derivatives (The Curvature Map).