Approximation: The Taylor Series

1. Introduction: Complexity from Simplicity

How does a calculator compute sin(37°)? It doesn’t draw a triangle. It uses Polynomials.

Calculus tells us that any smooth function (like ex, sin(x), or a Neural Network Loss Function) can be approximated locally by a sum of simpler terms:

  1. A Constant (Start point).
  2. A Line (Slope).
  3. A Parabola (Curve).
  4. …and so on.

This is the basis of Gradient Descent (First-Order Approximation) and Newton’s Method (Second-Order Approximation).


2. The Formula

The Taylor Series of f(x) centered at a is:

f(x) ≈ f(a) + f’(a)(x-a) + f’‘(a)/2! (x-a)2 + f’’‘(a)/3! (x-a)3 + …

Why “Factorials”?

Imagine we want to approximate f(x) with a polynomial P(x) = c0 + c1x + c2x2 + c3x3. We want the derivatives of P(x) to match f(x) at x=0.

  1. P’(x) = c1 + 2c2x + 3c3x2. At x=0, P’(0) = c1. So c1 = f’(0).
  2. P’‘(x) = 2c2 + 3×2c3x. At x=0, P’‘(0) = 2c2. So c2 = f’‘(0)/2.
  3. P’’‘(x) = 3×2c3. At x=0, P’’‘(0) = 3×2c3 = 6c3. So c3 = f’’‘(0)/6 = f’’‘(0)/3!.

The factorials (n!) appear because differentiating xn repeatedly brings down n, n-1, n-2… as multipliers. We divide by them to normalize.


3. Example: Approximating sin(x) at a=0

At x=0, sin(0)=0, cos(0)=1.

  • f(x) = sin(x) → f(0) = 0
  • f’(x) = cos(x) → f’(0) = 1
  • f’‘(x) = -sin(x) → f’‘(0) = 0
  • f’’‘(x) = -cos(x) → f’’‘(0) = -1

Putting it together (Maclaurin Series):
sin(x) ≈ x - x3/3! + x5/5! - x7/7! …


4. Connection to Optimization

In Machine Learning, we want to find the minimum of a Loss Function L(w).

4.1 First-Order: Gradient Descent

We use the first two terms (Line Approximation):
L(w+h) ≈ L(w) + L’(w)h
We step downhill along this line.

4.2 Second-Order: Newton’s Method

We use the first three terms (Parabola Approximation):
L(w+h) ≈ L(w) + L’(w)h + 0.5 L’‘(w)h2
We jump directly to the bottom of this parabola.

[!TIP] Trade-off: Newton’s Method is faster (fewer steps) but computationally expensive because calculating the second derivative (Hessian Matrix) for millions of parameters is impossible. That’s why we stick to Gradient Descent.


5. Interactive Visualizer: Polynomial Builder

See how adding more terms (N) makes the polynomial PN(x) hug the target curve tightly near the center (x=0), but diverge far away.

  • N=1: Line (Linear Model). Good only near center.
  • N=3: Cubic curve.
  • Show Error: Visualize the difference (Residuals) between truth and approximation.

Target: f(x)
Approx: PN(x)
Notice how the approximation "explodes" far from the center (x=0). This is why local approximations only work locally!

6. Summary

  • Taylor Series: Converting hard functions to easy polynomials.
  • Gradient Descent: A Linear approximation (Taylor N=1). We step down the slope.
  • Newton’s Method: A Quadratic approximation (Taylor N=2). We jump to the minimum of the parabola.
  • Trust Region: We must restrict our steps to the region where the Taylor approximation is valid.

Next: Gradient Descent Case Study