Activation Functions

[!IMPORTANT] Without activation functions, a neural network—no matter how many layers it has—would just be a big linear regression model. Activation functions introduce non-linearity, allowing the network to learn complex patterns.

1. Why Non-Linearity?

If we only use linear operations (weighted sums), the entire network collapses into a single linear transformation.

Output = W2(W1(x)) = (W2 * W1)x = W_new * x

To approximate any function (Universal Approximation Theorem), we need to bend and twist the decision boundaries. This is what activation functions do.

The Paper Folding Analogy: Imagine trying to separate red and blue dots drawn on a flat sheet of paper by drawing a single straight line (a linear function). If the dots are arranged in a circle, one line won’t work. However, if you fold and crumple the paper (apply non-linearity) so all the red dots are elevated, you can then slice horizontally to separate them. Activation functions are what “fold” the geometric space.

2. Common Activation Functions

Interactive Visualizer

Select an activation function to see its shape (Blue) and its derivative (Red). The derivative is crucial for backpropagation.

Function:

Range:

2.1 Sigmoid

  • Formula: σ(x) = 1 / (1 + e⁻ˣ)
  • Range: (0, 1)
  • Pros: Smooth gradient, output as probability.
  • Cons:
  • Vanishing Gradient: Gradients become very small at tails (-inf or +inf), killing learning.
  • Not Zero-Centered: Outputs are always positive, making optimization zigzag.

2.2 Tanh (Hyperbolic Tangent)

  • Formula: tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
  • Range: (-1, 1)
  • Pros: Zero-centered (stronger gradients than Sigmoid).
  • Cons: Still suffers from vanishing gradient problem.

2.3 ReLU (Rectified Linear Unit)

  • Formula: f(x) = max(0, x)
  • Analogy: Think of a light dimmer that doesn’t turn on until the dial passes zero. Before zero, it’s completely dark (0). After zero, the brightness increases linearly with the dial.
  • Range: [0, ∞)
  • Pros:
  • Computationally Efficient: Just a thresholding at zero.
  • Solves Vanishing Gradient: Gradient is 1 for positive inputs.
  • Sparsity: Outputs true 0 for negative inputs.
  • Cons:
  • Dead ReLU: Neurons can “die” if weights update such that input is always negative (gradient becomes 0 forever).

2.4 Softmax

Used exclusively in the output layer for multi-class classification. It converts raw logits into probabilities that sum to 1.

P(y=j) = e^(z_j) / Σ e^(z_k)

  • Analogy: Think of taking an unbaked pie of arbitrary size (the raw output) and dividing it into perfectly scaled slices for your guests, where every slice is positive and the total pie always equals exactly 100%.

3. Implementation in Python

import numpy as np

def sigmoid(x):
  return 1 / (1 + np.exp(-x))

def tanh(x):
  return np.tanh(x)

def relu(x):
  return np.maximum(0, x)

def softmax(x):
  # Subtract max for numerical stability (prevents overflow)
  e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
  return e_x / np.sum(e_x, axis=-1, keepdims=True)

# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"Sigmoid: {sigmoid(x)}")
print(f"ReLU: {relu(x)}")

4. Which One to Use?

[!TIP] Rule of Thumb:

  • Start with ReLU for hidden layers.
  • If you face Dead ReLU issues, try Leaky ReLU.
  • Use Sigmoid for binary classification output.
  • Use Softmax for multi-class classification output.
  • Avoid Sigmoid/Tanh in hidden layers for deep networks.