Module Review: Advanced Optimization

Note

Training a neural network is like navigating a rugged, alien mountain range in pitch black darkness where you only know the slope of the ground beneath your feet. This module review consolidates how we mathematically define this landscape and use advanced optimizers (like Adam) to safely reach the lowest valley (global minimum loss) without getting stuck on flat plateaus.

1. Key Takeaways

  • Loss Landscape: The geometry of the loss function determines training difficulty. Convex functions are easy (bowl-shaped); Neural Networks are Non-Convex (rugged), plagued by Saddle Points rather than local minima.
  • Optimizers:
  • SGD: The baseline. Struggles in ravines and gets stuck on plateaus.
  • Momentum: Adds “velocity” to the optimizer, allowing it to plow through flat regions and dampen oscillations.
  • Adam: The gold standard. Combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates for each parameter.
  • Constrained Optimization: To optimize under constraints (\(g(x)=0\)), we use Lagrange Multipliers (\(\nabla f = \lambda \nabla g\)), finding points where the objective and constraint gradients align.
  • AutoDiff: Modern frameworks use Reverse Mode AutoDiff (Backpropagation), which efficiently computes gradients for millions of inputs (parameters) in a single backward pass.
  • Backpropagation: The Chain Rule applied to the computational graph. Deep networks with Sigmoid activations suffer from Vanishing Gradients because derivatives (\(<0.25\)) multiply to zero.

2. Interactive Flashcards

What is a Saddle Point?

Tap to flip

A point where the gradient is zero ($$\nabla L = 0$$), but it is a minimum in one direction and a maximum in another. It is the main obstacle in high-dimensional optimization.

Analogy: Think of a Pringles potato chip—flat in the middle, curving up on the sides, and curving down on the front and back.

Why use Reverse Mode AutoDiff for ML?

Tap to flip

Because ML models have millions of inputs (parameters) but only one output (Loss). Reverse mode computes all gradients in a single backward pass, whereas Forward mode would require millions of passes.

Analogy: Forward mode is like asking "How does changing this one parameter affect the loss?" millions of times. Reverse mode asks "How does the loss depend on all parameters?" just once.

What does Adam do?

Tap to flip

It combines Momentum (First Moment) and RMSProp (Second Moment) to adapt learning rates individually for each parameter.

Analogy: Like a heavy ball (Momentum) rolling down a hill with friction that adapts to how bumpy each direction is (RMSProp).

What causes Vanishing Gradients?

Tap to flip

Multiplying many small derivatives (e.g., Sigmoid max derivative is $$0.25$$) during backpropagation, causing gradients at early layers to shrink to zero.

Analogy: Like playing a game of Telephone across 50 people where the message gets quieter and quieter until the first person hears nothing at all.

What is Jensen's Inequality?

Tap to flip

For a convex function $$f$$, the function of the average is less than or equal to the average of the function values: $$f(\mathbb{E}[x]) \le \mathbb{E}[f(x)]$$.

Analogy: If you walk in a straight line between two points on a valley-shaped hill, your path is always higher than the valley floor below you.

What is the Tangency Condition?

Tap to flip

In constrained optimization, the optimal point occurs where the constraint boundary runs parallel to the objective's contour lines ($$\nabla f = \lambda \nabla g$$).

Analogy: Walking along a fence (constraint) on a hill until you reach the highest point without crossing the fence—at this peak, the fence direction perfectly aligns with the hill's contour lines.


3. Cheat Sheet: Optimizers

Optimizer Formula (Simplified) Pros Cons
SGD \(w = w - \eta \nabla L\) Simple, Low memory Slow, stuck in saddle points, oscillates
Momentum \(v = \beta v + (1-\beta)\nabla L\)
\(w = w - \eta v\)
Fast in ravines, dampens oscillation Introduces new hyperparameter \(\beta\)
RMSProp \(v = \beta v + (1-\beta)(\nabla L)^2\)
\(w = w - \eta \frac{\nabla L}{\sqrt{v}}\)
Adaptive Learning Rate per parameter No momentum, can get stuck in local minima
Adam Momentum + RMSProp Fast, Robust, Defacto standard Can generalize slightly worse than SGD on simple problems

4. Next Steps

Now that you understand how to train networks, let’s explore the advanced linear algebra that powers them.

Math ML Glossary