Accelerating Descent: Momentum & Adam

04-advanced-optimization

Accelerating Descent: Momentum & Adam

1. Introduction: SGD is Slow

Standard Gradient Descent (SGD) has problems:

Zig-Zagging: In ravines, it bounces back and forth between walls instead of moving down the valley.
Saddle Points: It gets stuck on flat plateaus where the gradient is near zero.

We need algorithms that have “velocity” and “intelligence”.

2. Momentum

Imagine a heavy ball rolling down a hill.

Physics: It gains speed. If it hits a small bump, its momentum carries it over.
Math: \(v_{t+1} = \beta v_t + (1-\beta) \nabla L\) \(w_{t+1} = w_t - \alpha v_{t+1}\) Accumulates past gradients to smooth out the path.

3. RMSProp & Adam (Adaptive Learning Rates)

Different parameters need different learning rates.

Sparse features: Rare words in NLP need larger updates.
Adam (Adaptive Moment Estimation): Combines Momentum (Velocity) and RMSProp (Scaling learning rate by variance).

It is the default optimizer for nearly all Deep Learning today.

4. Interactive Visualizer: The Great Race

Watch three balls race to the center (Minimum).

Red (SGD): Slow, gets confused by the noise.
Blue (Momentum): Builds speed, overshoots slightly but corrects.
Green (Adam): Fast and precise.

● SGD ● Momentum ● Adam

5. Summary

SGD: Baseline, can be slow.
Momentum: Adds velocity to plow through noise and valleys.
Adam: Adapts to the terrain geometry, speeding up on flat surfaces and slowing down on steep cliffs.

Next: Constrained Optimization →