The Toolbox: Rules of Calculus

1. Introduction: Shortcuts for Machines

We don’t calculate limits manually every time. We use Differentiation Rules. These are the shortcuts that allow computers to calculate gradients for massive Neural Networks efficiently.

Imagine trying to find the slope of a mountain by measuring every inch with a ruler. That’s the limit definition. Differentiation rules are like having a satellite map that instantly tells you the slope at any coordinate.

2. Basic Rules

2.1 Power Rule

If f(x) = xⁿ, then:
f’(x) = n &cdot; x^n-1

Example: f(x) = x² → f’(x) = 2x
Example: f(x) = x³ → f’(x) = 3x²
Example: f(x) = 1/x = x^-1 → f’(x) = -x^-2 = -1/x²

2.2 Constant Multiple & Sum Rule

Constant: d/dx [c &cdot; f(x)] = c &cdot; f’(x)
Sum: d/dx [f(x) + g(x)] = f’(x) + g’(x)

3. Advanced Rules

3.1 Product Rule

If you have two functions multiplied: y = u &cdot; v
y’ = u’v + uv’
“Derivative of the first times the second, plus the first times derivative of the second.”

[!TIP] Why? Imagine a rectangle with sides u and v. If both sides grow, the area uv grows by a strip of width u (height dv) and a strip of height v (width du).

3.2 Quotient Rule

If you have division: y = u / v
y’ = (u’v - uv’) / v²
“Low d-High minus High d-Low, over Low Low.”

4. The Chain Rule (The Holy Grail)

This is the most important rule for Deep Learning because Neural Networks are just nested functions (layers feeding into layers).

If y = f(g(x)), then:
dy/dx = dy/dg &cdot; dg/dx

Or more simply:
Total Derivative = (Outer Derivative) × (Inner Derivative)

4.1 Example: A Simple Neuron

Consider a simple node in a neural network:
L = (y - σ(w &cdot; x))²

Here, we have a chain of operations:

Linear: z = w &cdot; x
Activation: a = σ(z)
Loss: L = (y - a)²

To find how Loss L changes with weight w (dL/dw), we chain backwards:
dL/dw = (dL/da) &cdot; (da/dz) &cdot; (dz/dw)

This process of multiplying local derivatives backwards is called Backpropagation.

4.2 Computational Graphs & Backpropagation

Deep Learning frameworks represent this as a graph. During the Backward Pass, gradients flow from right to left.

Node: Represents an operation (e.g., Multiply, Add, Sigmoid).
Edge: Carries the data flow.
Gradient Flow: The gradient dL/dOutput comes in from the right. The node computes its local gradient dOutput/dInput. It multiplies them and passes the result dL/dInput to the left.

Rule: Gradient Out = (Gradient In) × (Local Gradient).

x
→
Multiply (w)

                local: w
            
→
Sigmoid (σ)

                local: σ'(z)
            
→
Loss (MSE)

                local: 2(y-a)
            
→
L

Forward Pass → | ← Backward Pass (Gradients)

5. Interactive Visualizer: Chain Rule Gears

Imagine 3 gears connected.

Gear A (Input x) turns.
Gear B (Hidden u) turns 2× faster than A (du/dx = 2).
Gear C (Output y) turns 3× faster than B (dy/du = 3).

What is the total ratio dy/dx? It’s 2 × 3 = 6. Adjust the speed slider to see how the motion (sensitivity) amplifies through the chain.

Input (x)

Driver Gear

du/dx

×2

Hidden (u)

Intermediate

dy/du

×3

Output (y)

Result

Input Speed:

            Total Amplification (dy/dx) = 6x
        

6. Summary

Power Rule: Differentiating simple powers (nx^n-1).
Chain Rule: The engine of Backpropagation. It allows us to calculate gradients for deep networks by multiplying local derivatives layer by layer.
Computational Graphs: The modern way we represent and differentiate these complex functions automatically.

Next: Multivariable Calculus