Measuring Surprise: Information & Entropy

1. Introduction: The Bit

Information Theory is the mathematical study of Surprise. Imagine I send you a message. How much information did you actually receive?

  • Scenario A: I tell you, “The sun rose today.”
    • Surprise: Zero.
    • Information: Zero. You already knew that with 100% certainty.
  • Scenario B: I tell you, “I just won the lottery.”
    • Surprise: Massive.
    • Information: High. This drastically changes your knowledge of the world.

Claude Shannon, the father of the Digital Age, defined the fundamental unit of information as the Bit. One bit is the amount of information required to distinguish between two equally likely alternatives (e.g., a Fair Coin flip).

Surprisal (Self-Information)

The Surprisal of an event x is defined as:

I(x) = -log2 P(x)
  • If P(x) = 1 (Certainty), I(x) = -log2(1) = 0 bits.
  • If P(x) = 0.5 (Fair Coin), I(x) = -log2(0.5) = 1 bit.
  • If P(x) = 0.1 (Rare), I(x) ≈ 3.32 bits.

[!TIP] Interview Tip: Why log base 2? Because computers use binary (0 and 1). If we used log base 10, the unit would be “Hartleys” (decimal digits). If we used natural log (ln), the unit would be “nats”. In Deep Learning, we often use ‘nats’ because derivatives of $e^x$ are cleaner, but we still call the loss “Cross-Entropy”.


2. Entropy (H)

Entropy is the average surprisal (or expected information) of a random variable. It measures the total uncertainty, chaos, or impurity in a system.

For a probability distribution P, Entropy H is:

H(P) = - Σ P(x) · log2 P(x)

The Coin Example

Let P(Heads) = p.

  1. Fair Coin (p = 0.5):
    • The outcome is completely unpredictable.
    • H = -(0.5 · log 0.5 + 0.5 · log 0.5) = 1 bit.
    • Maximum Entropy.
  2. Rigged Coin (p = 0.9):
    • You can guess “Heads” and be right 90% of the time.
    • There is less uncertainty.
    • H ≈ 0.47 bits.
  3. Double Headed Coin (p = 1.0):
    • Zero uncertainty. You know it’s Heads.
    • H = 0 bits.
    • Minimum Entropy.

3. Interactive Visualizer: The Surprise Game

Goal: Minimize your “Surprise Score”. You are betting on coin flips.

  • If you bet on the likely outcome (e.g., Heads when P(Heads)=0.9), and you win, your surprise (cost) is low.
  • If the rare outcome happens, your surprise is HUGE.

Adjust the Coin Fairness slider. Notice that:

  1. High Entropy (0.5): You are always moderately surprised (1 bit).
  2. Low Entropy (0.9): You are usually not surprised (0.15 bits), but occasionally shocked (3.3 bits).

H(p) = 1.00
Result
-
Surprisal
0.00 bits

4. Huffman Coding: Practical Compression

How do we use Entropy to compress files? If we know that the letter ‘e’ is very common (High Probability, Low Surprisal), and ‘z’ is rare (Low Probability, High Surprisal), we should give ‘e’ a short code and ‘z’ a long code.

Huffman Coding builds a binary tree based on frequency.

  • e: 0 (1 bit)
  • z: 10110 (5 bits)

The average length of the code approaches the Entropy of the source. You literally cannot compress lossless data smaller than its Entropy.


5. Cross-Entropy Loss

In Machine Learning (Classification), we don’t just care about the entropy of one distribution. We care about the difference between two. Imagine an AI trying to classify an image.

The “Dog vs Wolf” Example

  • True Label (P): The image is definitely a Dog.
    • P = [Dog: 1.0, Wolf: 0.0]
  • Model Prediction (Q): The model thinks it’s likely a Dog, but isn’t sure.
    • Q = [Dog: 0.8, Wolf: 0.2]

We want Q to look like P. We minimize Cross-Entropy:

H(P, Q) = - Σ P(x) · log(Q(x))

Since P is a “One-Hot” vector (1 for Dog, 0 for Wolf), the terms where P(x)=0 vanish. The formula simplifies dramatically to just the log-probability of the correct class:

Loss = -log(Qdog)
  • Case 1 (Good Model): Prediction Qdog = 0.9.
    • Loss = -log(0.9) ≈ 0.10. (Low Penalty)
  • Case 2 (Bad Model): Prediction Qdog = 0.1.
    • Loss = -log(0.1) ≈ 2.30. (High Penalty)

This is why Neural Networks punish confident wrong answers so heavily.


6. KL Divergence (Relative Entropy)

KL Divergence measures “how much information is lost” when we use distribution Q to approximate P.

It is defined as:

DKL(P || Q) = H(P, Q) - H(P)
  • Distance Analogy: Think of it as the “distance” between two distributions (though it’s not symmetric, i.e., distance A→B ≠ B→A).
  • If P = Q, then DKL = 0.
  • This is the core of Variational Autoencoders (VAEs), where we force the learned latent distribution to match a Standard Normal Distribution (Gaussian).

7. Summary

  • Surprisal: Rare events carry more information (high bit count).
  • Entropy: The average uncertainty of a system. The lower bound for compression (Huffman).
  • Cross-Entropy: The standard loss function for classification (minimizes surprise of the truth).
  • Bit: The currency of information.

Next: Graph Theory Basics →