DL App: Neural Network Layers
1. Introduction: The Building Block
A Neural Network is just a chain of Linear Algebra operations interspersed with non-linear functions. The core component is the Dense Layer (or Fully Connected Layer).
Mathematically, a layer transforms an input vector x into an output vector y:
- x: Input Vector (Shape: N × 1).
- W: Weight Matrix (Shape: M × N). This rotates and stretches the input space.
- b: Bias Vector (Shape: M × 1). This shifts the space (translation).
- σ: Activation Function (e.g., ReLU). This bends or folds the space.
The Manifold Hypothesis
Why does this work? Real-world data (like images of cats) lies on a low-dimensional “manifold” (a crumpled sheet) inside a high-dimensional space. The goal of the neural network is to uncrumple this sheet so that the classes (cats vs dogs) can be separated by a simple line.
2. The Activation Function (Non-Linear)
Without σ, a deep network would just be one big linear matrix (since W2(W1x) = Wnewx). The activation function introduces non-linearity.
A. ReLU (Rectified Linear Unit)
- Effect: Folds the space along the axes. Points in the negative quadrant get squashed to zero.
- Pros: Efficient, solves Vanishing Gradient.
- Cons: “Dead ReLU” (if inputs are always negative, gradients die).
B. Leaky ReLU
- Effect: Similar to ReLU, but allows a tiny “leak” for negative values.
- Pros: Fixes the “Dead ReLU” problem.
C. Sigmoid / Tanh
- Effect: Squashes space into a bounded range [0, 1] or [-1, 1].
- Pros: Smooth, probability-like.
- Cons: Vanishing Gradient. Notice in the visualizer how large inputs get squashed into a tiny region where the slope is almost zero? That kills learning.
3. Interactive Visualizer: The Neural Fold v3.0
Below, we visualize a single layer with 2 inputs and 2 neurons. We start with a grid of points (Blue).
- Linear Step: Apply Wx. (Shear/Rotate).
- Activation Step: Apply σ(z).
Task: Switch between ReLU, Leaky ReLU, and Sigmoid. Observe how ReLU folds the space like a piece of paper, while Leaky ReLU bends it slightly.
4. Summary
- W (Weights): Linearly transforms the space (Rotate/Scale/Shear).
- b (Bias): Translates the space.
- Activation: Non-linearly warps the space.
- ReLU: Folds space. Good for Deep Learning.
- Sigmoid: Squashes space. Good for probability output, bad for deep layers (Vanishing Gradient).