Accelerating Descent: Momentum & Adam
1. Introduction: SGD is Slow
Standard Gradient Descent (SGD) has problems:
- Zig-Zagging: In ravines, it bounces back and forth between walls instead of moving down the valley.
- Saddle Points: It gets stuck on flat plateaus where the gradient is near zero.
We need algorithms that have “velocity” and “intelligence”.
2. Momentum
Imagine a heavy ball rolling down a hill.
- Physics: It gains speed. If it hits a small bump, its momentum carries it over.
- Math: \(v_{t+1} = \beta v_t + (1-\beta) \nabla L\) \(w_{t+1} = w_t - \alpha v_{t+1}\) Accumulates past gradients to smooth out the path.
3. RMSProp & Adam (Adaptive Learning Rates)
Different parameters need different learning rates.
- Sparse features: Rare words in NLP need larger updates.
- Adam (Adaptive Moment Estimation): Combines Momentum (Velocity) and RMSProp (Scaling learning rate by variance).
It is the default optimizer for nearly all Deep Learning today.
4. Interactive Visualizer: The Great Race
Watch three balls race to the center (Minimum).
- Red (SGD): Slow, gets confused by the noise.
- Blue (Momentum): Builds speed, overshoots slightly but corrects.
- Green (Adam): Fast and precise.
● SGD
● Momentum
● Adam
5. Summary
- SGD: Baseline, can be slow.
- Momentum: Adds velocity to plow through noise and valleys.
- Adam: Adapts to the terrain geometry, speeding up on flat surfaces and slowing down on steep cliffs.