Mutual Information

While Entropy measures the uncertainty of a single variable, Mutual Information (MI) measures the relationship between two random variables. It tells us how much knowing one variable reduces uncertainty about the other.

1. Joint and Conditional Entropy

Before defining Mutual Information, we need two building blocks:

Joint Entropy H(X, Y)

The uncertainty associated with a pair of random variables (X, Y).

H(X, Y) = - Σ x Σ y P(x, y) log P(x, y)

Conditional Entropy H(X|Y)

The average uncertainty remaining in X after we know Y.

H(X|Y) = Σ y P(y) H(X|Y=y)

H(X|Y) = - Σ x Σ y P(x, y) log P(x|y)

[!NOTE] The Chain Rule for Entropy: H(X, Y) = H(X) + H(Y|X) The total uncertainty of X and Y is the uncertainty of X plus the remaining uncertainty of Y once X is known.

2. Mutual Information (MI)

Mutual Information I(X; Y) is the reduction in uncertainty of X due to the knowledge of Y.

I(X; Y) = H(X) - H(X|Y)

It is symmetric: I(X; Y) = I(Y; X). It can also be written as:

I(X; Y) = H(X) + H(Y) - H(X, Y)

If X and Y are independent, knowing Y gives no information about X, so $H(X \mid Y) = H(X)$ and $I(X; Y) = 0$.

Interactive: The Information Venn Diagram

The relationship between these quantities is best visualized as a Venn Diagram.

Circle 1 (Left): Entropy of X, H(X).
Circle 2 (Right): Entropy of Y, H(Y).
Intersection: Mutual Information I(X; Y).
Union: Joint Entropy H(X, Y).

Use the sliders to adjust the uncertainty of X and Y, and their overlap (Mutual Information).

Entropy H(X): 4.0

Entropy H(Y): 4.0

Mutual Info I(X;Y): 2.0

H(X|Y)

H(Y|X)

I(X;Y)

H(X) 4.00

H(Y) 4.00

          I(X;Y)
          2.00
      

H(X|Y) 2.00

H(X,Y) 6.00

3. Intuition: Information Gain

Think of MI as Information Gain.

Scenario A: Correlation.
X = “It is raining”.
Y = “Ground is wet”.
Knowing Y significantly reduces the uncertainty about X. MI is High.
Scenario B: Independence.
X = “It is raining”.
Y = “Stock market is up”.
Knowing Y tells you nothing about X. MI is Zero.

In Decision Trees (e.g., ID3, C4.5 algorithms), we choose the split feature that maximizes Information Gain (Mutual Information) with the target variable.

Case Study: Deciphering the Enigma (A Historical War Story)

During World War II, Alan Turing and the team at Bletchley Park faced the monumental task of breaking the German Enigma code. At its core, the problem was one of Information Theory (even before Claude Shannon formalized the field!).

The Problem: The intercepted messages ($Y$) appeared as random noise. The true message ($X$) was unknown.
The Goal: Maximize the Mutual Information $I(X; Y)$. Since $Y$ alone gave no information about $X$ (because $H(X \mid Y)$ was very high due to the encryption key), they needed to find the key.
The Solution: Turing didn’t just guess keys randomly. He looked for “cribs” (known plaintexts like “WETTER” - weather). By knowing a small piece of $X$, and seeing the corresponding $Y$, he drastically reduced the conditional entropy $H(\text{Key} \mid X, Y)$. This meant the Mutual Information between the intercepted message and the potential keys skyrocketed, allowing the Bombe machine to rule out millions of incorrect settings in minutes.

Pedagogy Note: This is why Mutual Information is so powerful. It’s not just about correlation; it’s about how much the uncertainty (Entropy) of a system collapses when you gain a new piece of data.

4. Code Implementation

Calculate Mutual Information from discrete data in Java, Go, and Python.

Java

import java.util.HashMap;
import java.util.Map;

public class MutualInformation {

  public static double calculateMI(int[] x, int[] y) {
  int n = x.length;
  Map<Integer, Double> pX = new HashMap<>();
  Map<Integer, Double> pY = new HashMap<>();
  Map<String, Double> pXY = new HashMap<>();

  // 1. Calculate Probabilities
  for (int i = 0; i < n; i++) {
    pX.put(x[i], pX.getOrDefault(x[i], 0.0) + 1.0);
    pY.put(y[i], pY.getOrDefault(y[i], 0.0) + 1.0);
    String key = x[i] + "," + y[i];
    pXY.put(key, pXY.getOrDefault(key, 0.0) + 1.0);
  }

  // Normalize
  pX.replaceAll((k, v) -> v / n);
  pY.replaceAll((k, v) -> v / n);
  pXY.replaceAll((k, v) -> v / n);

  // 2. Calculate MI
  // I(X;Y) = Sum P(x,y) * log2(P(x,y) / (P(x)*P(y)))
  double mi = 0.0;
  for (Map.Entry<String, Double> entry : pXY.entrySet()) {
    String[] parts = entry.getKey().split(",");
    int valX = Integer.parseInt(parts[0]);
    int valY = Integer.parseInt(parts[1]);
    double jointP = entry.getValue();
    double marginalP = pX.get(valX) * pY.get(valY);

    if (jointP > 0) {
    mi += jointP * (Math.log(jointP / marginalP) / Math.log(2));
    }
  }
  return mi;
  }

  public static void main(String[] args) {
  int[] X = {0, 0, 1, 1, 0, 1, 0, 1, 1, 1};
  int[] Y = {0, 0, 1, 1, 0, 1, 1, 1, 1, 0};
  System.out.printf("Mutual Information: %.4f bits%n", calculateMI(X, Y));
  }
}

Go

package main

import (
  "fmt"
  "math"
)

func CalculateMI(x, y []int) float64 {
  n := float64(len(x))
  pX := make(map[int]float64)
  pY := make(map[int]float64)
  pXY := make(map[string]float64)

  // 1. Calculate Counts
  for i := 0; i < len(x); i++ {
  pX[x[i]]++
  pY[y[i]]++
  key := fmt.Sprintf("%d,%d", x[i], y[i])
  pXY[key]++
  }

  // 2. Calculate MI
  mi := 0.0
  for key, count := range pXY {
  jointP := count / n

  var valX, valY int
  fmt.Sscanf(key, "%d,%d", &valX, &valY)

  px := pX[valX] / n
  py := pY[valY] / n

  if jointP > 0 {
    mi += jointP * math.Log2(jointP/(px*py))
  }
  }
  return mi
}

func main() {
  X := []int{0, 0, 1, 1, 0, 1, 0, 1, 1, 1}
  Y := []int{0, 0, 1, 1, 0, 1, 1, 1, 1, 0}
  fmt.Printf("Mutual Information: %.4f bits\n", CalculateMI(X, Y))
}

Python

import numpy as np
from sklearn.metrics import mutual_info_score
from scipy.stats import entropy

# Example: X and Y are discrete variables
# 0 = Rain, 1 = No Rain
# 0 = Umbrella, 1 = No Umbrella
X = [0, 0, 1, 1, 0, 1, 0, 1, 1, 1]
Y = [0, 0, 1, 1, 0, 1, 1, 1, 1, 0] # Y is correlated with X

# 1. Using sklearn (works on raw labels)
# Note: mutual_info_score uses natural log (nats)
mi_nats = mutual_info_score(X, Y)
mi_bits = mi_nats / np.log(2)
print(f"Mutual Information: {mi_bits:.4f} bits")

# 2. Manual Calculation from Joint Distribution
def calc_mutual_information(x, y):
  # Compute Joint Probability Matrix P(x,y)
  coords = np.stack((x, y), axis=-1)
  unique, counts = np.unique(coords, axis=0, return_counts=True)
  P_xy = counts / len(x)

  # Compute Marginals P(x) and P(y)
  P_x = np.array([np.sum(P_xy[unique[:,0] == val]) for val in np.unique(x)])
  P_y = np.array([np.sum(P_xy[unique[:,1] == val]) for val in np.unique(y)])

  # H(X) and H(Y)
  H_x = entropy(P_x, base=2)
  H_y = entropy(P_y, base=2)

  # H(X,Y)
  H_xy = entropy(P_xy, base=2)

  # I(X;Y) = H(X) + H(Y) - H(X,Y)
  return H_x + H_y - H_xy

mi_manual = calc_mutual_information(X, Y)
print(f"Manual MI Calculation: {mi_manual:.4f} bits")

Key Takeaways

MI measures dependency: It quantifies how much information X and Y share.
Non-linear: Unlike correlation (which measures linear relationships), MI captures non-linear dependencies.
Feature Selection: In Machine Learning, MI is often used to select features that are most informative about the target variable.

Mutual Information

Mutual Information

1. Joint and Conditional Entropy

Joint Entropy H(X, Y)

Conditional Entropy H(X|Y)

2. Mutual Information (MI)

Interactive: The Information Venn Diagram

3. Intuition: Information Gain

Case Study: Deciphering the Enigma (A Historical War Story)

4. Code Implementation

Java

Go

Python

Key Takeaways

Found this lesson helpful?