Mutual Information
While Entropy measures the uncertainty of a single variable, Mutual Information (MI) measures the relationship between two random variables. It tells us how much knowing one variable reduces uncertainty about the other.
1. Joint and Conditional Entropy
Before defining Mutual Information, we need two building blocks:
Joint Entropy H(X, Y)
The uncertainty associated with a pair of random variables (X, Y).
Conditional Entropy H(X|Y)
The average uncertainty remaining in X after we know Y.
[!NOTE] The Chain Rule for Entropy: H(X, Y) = H(X) + H(Y|X) The total uncertainty of X and Y is the uncertainty of X plus the remaining uncertainty of Y once X is known.
2. Mutual Information (MI)
Mutual Information I(X; Y) is the reduction in uncertainty of X due to the knowledge of Y.
It is symmetric: I(X; Y) = I(Y; X). It can also be written as:
If X and Y are independent, knowing Y gives no information about X, so $H(X \mid Y) = H(X)$ and $I(X; Y) = 0$.
Interactive: The Information Venn Diagram
The relationship between these quantities is best visualized as a Venn Diagram.
- Circle 1 (Left): Entropy of X, H(X).
- Circle 2 (Right): Entropy of Y, H(Y).
- Intersection: Mutual Information I(X; Y).
- Union: Joint Entropy H(X, Y).
Use the sliders to adjust the uncertainty of X and Y, and their overlap (Mutual Information).
3. Intuition: Information Gain
Think of MI as Information Gain.
- Scenario A: Correlation.
- X = “It is raining”.
- Y = “Ground is wet”.
- Knowing Y significantly reduces the uncertainty about X. MI is High.
- Scenario B: Independence.
- X = “It is raining”.
- Y = “Stock market is up”.
- Knowing Y tells you nothing about X. MI is Zero.
In Decision Trees (e.g., ID3, C4.5 algorithms), we choose the split feature that maximizes Information Gain (Mutual Information) with the target variable.
Case Study: Deciphering the Enigma (A Historical War Story)
During World War II, Alan Turing and the team at Bletchley Park faced the monumental task of breaking the German Enigma code. At its core, the problem was one of Information Theory (even before Claude Shannon formalized the field!).
- The Problem: The intercepted messages ($Y$) appeared as random noise. The true message ($X$) was unknown.
- The Goal: Maximize the Mutual Information $I(X; Y)$. Since $Y$ alone gave no information about $X$ (because $H(X \mid Y)$ was very high due to the encryption key), they needed to find the key.
- The Solution: Turing didn’t just guess keys randomly. He looked for “cribs” (known plaintexts like “WETTER” - weather). By knowing a small piece of $X$, and seeing the corresponding $Y$, he drastically reduced the conditional entropy $H(\text{Key} \mid X, Y)$. This meant the Mutual Information between the intercepted message and the potential keys skyrocketed, allowing the Bombe machine to rule out millions of incorrect settings in minutes.
Pedagogy Note: This is why Mutual Information is so powerful. It’s not just about correlation; it’s about how much the uncertainty (Entropy) of a system collapses when you gain a new piece of data.
4. Code Implementation
Calculate Mutual Information from discrete data in Java, Go, and Python.
Java
import java.util.HashMap;
import java.util.Map;
public class MutualInformation {
public static double calculateMI(int[] x, int[] y) {
int n = x.length;
Map<Integer, Double> pX = new HashMap<>();
Map<Integer, Double> pY = new HashMap<>();
Map<String, Double> pXY = new HashMap<>();
// 1. Calculate Probabilities
for (int i = 0; i < n; i++) {
pX.put(x[i], pX.getOrDefault(x[i], 0.0) + 1.0);
pY.put(y[i], pY.getOrDefault(y[i], 0.0) + 1.0);
String key = x[i] + "," + y[i];
pXY.put(key, pXY.getOrDefault(key, 0.0) + 1.0);
}
// Normalize
pX.replaceAll((k, v) -> v / n);
pY.replaceAll((k, v) -> v / n);
pXY.replaceAll((k, v) -> v / n);
// 2. Calculate MI
// I(X;Y) = Sum P(x,y) * log2(P(x,y) / (P(x)*P(y)))
double mi = 0.0;
for (Map.Entry<String, Double> entry : pXY.entrySet()) {
String[] parts = entry.getKey().split(",");
int valX = Integer.parseInt(parts[0]);
int valY = Integer.parseInt(parts[1]);
double jointP = entry.getValue();
double marginalP = pX.get(valX) * pY.get(valY);
if (jointP > 0) {
mi += jointP * (Math.log(jointP / marginalP) / Math.log(2));
}
}
return mi;
}
public static void main(String[] args) {
int[] X = {0, 0, 1, 1, 0, 1, 0, 1, 1, 1};
int[] Y = {0, 0, 1, 1, 0, 1, 1, 1, 1, 0};
System.out.printf("Mutual Information: %.4f bits%n", calculateMI(X, Y));
}
}
Go
package main
import (
"fmt"
"math"
)
func CalculateMI(x, y []int) float64 {
n := float64(len(x))
pX := make(map[int]float64)
pY := make(map[int]float64)
pXY := make(map[string]float64)
// 1. Calculate Counts
for i := 0; i < len(x); i++ {
pX[x[i]]++
pY[y[i]]++
key := fmt.Sprintf("%d,%d", x[i], y[i])
pXY[key]++
}
// 2. Calculate MI
mi := 0.0
for key, count := range pXY {
jointP := count / n
var valX, valY int
fmt.Sscanf(key, "%d,%d", &valX, &valY)
px := pX[valX] / n
py := pY[valY] / n
if jointP > 0 {
mi += jointP * math.Log2(jointP/(px*py))
}
}
return mi
}
func main() {
X := []int{0, 0, 1, 1, 0, 1, 0, 1, 1, 1}
Y := []int{0, 0, 1, 1, 0, 1, 1, 1, 1, 0}
fmt.Printf("Mutual Information: %.4f bits\n", CalculateMI(X, Y))
}
Python
import numpy as np
from sklearn.metrics import mutual_info_score
from scipy.stats import entropy
# Example: X and Y are discrete variables
# 0 = Rain, 1 = No Rain
# 0 = Umbrella, 1 = No Umbrella
X = [0, 0, 1, 1, 0, 1, 0, 1, 1, 1]
Y = [0, 0, 1, 1, 0, 1, 1, 1, 1, 0] # Y is correlated with X
# 1. Using sklearn (works on raw labels)
# Note: mutual_info_score uses natural log (nats)
mi_nats = mutual_info_score(X, Y)
mi_bits = mi_nats / np.log(2)
print(f"Mutual Information: {mi_bits:.4f} bits")
# 2. Manual Calculation from Joint Distribution
def calc_mutual_information(x, y):
# Compute Joint Probability Matrix P(x,y)
coords = np.stack((x, y), axis=-1)
unique, counts = np.unique(coords, axis=0, return_counts=True)
P_xy = counts / len(x)
# Compute Marginals P(x) and P(y)
P_x = np.array([np.sum(P_xy[unique[:,0] == val]) for val in np.unique(x)])
P_y = np.array([np.sum(P_xy[unique[:,1] == val]) for val in np.unique(y)])
# H(X) and H(Y)
H_x = entropy(P_x, base=2)
H_y = entropy(P_y, base=2)
# H(X,Y)
H_xy = entropy(P_xy, base=2)
# I(X;Y) = H(X) + H(Y) - H(X,Y)
return H_x + H_y - H_xy
mi_manual = calc_mutual_information(X, Y)
print(f"Manual MI Calculation: {mi_manual:.4f} bits")
Key Takeaways
- MI measures dependency: It quantifies how much information X and Y share.
- Non-linear: Unlike correlation (which measures linear relationships), MI captures non-linear dependencies.
- Feature Selection: In Machine Learning, MI is often used to select features that are most informative about the target variable.