The Toolbox: Rules of Calculus
1. Introduction: Shortcuts for Machines
We don’t calculate limits manually every time. We use Differentiation Rules. These are the shortcuts that allow computers to calculate gradients for massive Neural Networks efficiently.
Imagine trying to find the slope of a mountain by measuring every inch with a ruler. That’s the limit definition. Differentiation rules are like having a satellite map that instantly tells you the slope at any coordinate.
2. Basic Rules
2.1 Power Rule
If f(x) = xn, then:
f’(x) = n ċ xn-1
- Example: f(x) = x2 → f’(x) = 2x
- Example: f(x) = x3 → f’(x) = 3x2
- Example: f(x) = 1/x = x-1 → f’(x) = -x-2 = -1/x2
2.2 Constant Multiple & Sum Rule
- Constant: d/dx [c ċ f(x)] = c ċ f’(x)
- Sum: d/dx [f(x) + g(x)] = f’(x) + g’(x)
3. Advanced Rules
3.1 Product Rule
If you have two functions multiplied: y = u ċ v
y’ = u’v + uv’
“Derivative of the first times the second, plus the first times derivative of the second.”
[!TIP] Why? Imagine a rectangle with sides u and v. If both sides grow, the area uv grows by a strip of width u (height dv) and a strip of height v (width du).
3.2 Quotient Rule
If you have division: y = u / v
y’ = (u’v - uv’) / v2
“Low d-High minus High d-Low, over Low Low.”
4. The Chain Rule (The Holy Grail)
This is the most important rule for Deep Learning because Neural Networks are just nested functions (layers feeding into layers).
If y = f(g(x)), then:
dy/dx = dy/dg ċ dg/dx
Or more simply:
Total Derivative = (Outer Derivative) × (Inner Derivative)
4.1 Example: A Simple Neuron
Consider a simple node in a neural network:
L = (y - σ(w ċ x))2
Here, we have a chain of operations:
- Linear: z = w ċ x
- Activation: a = σ(z)
- Loss: L = (y - a)2
To find how Loss L changes with weight w (dL/dw), we chain backwards:
dL/dw = (dL/da) ċ (da/dz) ċ (dz/dw)
This process of multiplying local derivatives backwards is called Backpropagation.
4.2 Computational Graphs & Backpropagation
Deep Learning frameworks represent this as a graph. During the Backward Pass, gradients flow from right to left.
- Node: Represents an operation (e.g., Multiply, Add, Sigmoid).
- Edge: Carries the data flow.
- Gradient Flow: The gradient dL/dOutput comes in from the right. The node computes its local gradient dOutput/dInput. It multiplies them and passes the result dL/dInput to the left.
Rule: Gradient Out = (Gradient In) × (Local Gradient).
5. Interactive Visualizer: Chain Rule Gears
Imagine 3 gears connected.
- Gear A (Input x) turns.
- Gear B (Hidden u) turns 2× faster than A (du/dx = 2).
- Gear C (Output y) turns 3× faster than B (dy/du = 3).
What is the total ratio dy/dx? It’s 2 × 3 = 6. Adjust the speed slider to see how the motion (sensitivity) amplifies through the chain.
6. Summary
- Power Rule: Differentiating simple powers (nxn-1).
- Chain Rule: The engine of Backpropagation. It allows us to calculate gradients for deep networks by multiplying local derivatives layer by layer.
- Computational Graphs: The modern way we represent and differentiate these complex functions automatically.