Expectation, Variance, and Covariance

1. Introduction: Summarizing the Chaos

A dataset might have millions of points. We cannot look at all of them. We need Summary Statistics to describe the shape of the data.

Think of probability distributions as physical objects.

  1. Expectation: The Center of Mass. Where would the object balance on your finger?
  2. Variance: The Moment of Inertia. How hard is it to spin the object? (How spread out is the mass?)
  3. Covariance: The Rotation. How is the object tilted?

2. Expectation (E[X])

Also known as the Mean (μ). It is the weighted sum of all possible outcomes.

E[X] = ∑ x · P(X=x)

Linearity of Expectation

This is a superpower in math proofs. It holds even if X and Y are dependent!

E[aX + bY] = aE[X] + bE[Y]

  • ML Application: In Stochastic Gradient Descent (SGD), the gradient of a mini-batch is an unbiased estimator of the true gradient.

    E[∇ Batch] = ∇ True

    This means that on average, SGD moves in the correct direction, even if individual steps are noisy.


3. Variance (Var(X))

Variance measures the “spread” or “dispersion” from the mean.

Var(X) = E[ (X - μ)2 ]

Standard Deviation (σ)

σ = √(Var(X))

We prefer σ because it has the same units as the data (e.g., “meters” instead of “meters squared”).

  • ML Application:
    • Regression: The Mean Squared Error (MSE) loss function is essentially minimizing the variance of the residuals.
    • Regularization: High variance in a model’s prediction means Overfitting. We want a balance (Bias-Variance Tradeoff).

4. Covariance & Correlation

How do two variables X and Y relate?

  • Covariance: Cov(X, Y) = E[(X - &mu;<sub>x</sub>)(Y - &mu;<sub>y</sub>)].
    • Positive: As X goes up, Y tends to go up.
    • Negative: As X goes up, Y tends to go down.
    • Zero: No linear relationship.
  • Correlation (r or &rho;): Normalized Covariance between -1 and 1.

    ρXY = Cov(X, Y) / (σX σY)

The Covariance Matrix (&Sigma;)

In high dimensions (e.g., an image with 1000 pixels), we have a matrix where &Sigma;<sub>ij</sub> = Cov(X<sub>i</sub>, X<sub>j</sub>). This matrix defines the shape of the data cloud.

  • ML Application: Principal Component Analysis (PCA) calculates the Eigenvectors of the Covariance Matrix. These vectors point in the directions of greatest variance, allowing us to compress data by ignoring the “flat” directions.

5. Interactive Visualizer: Correlation Tuner

Visualize how Correlation (r) changes the shape of a Bivariate Normal Distribution. PCA Mode: Toggle to see the Eigenvectors (Principal Components). The Long Vector (PC1) represents the direction of maximum variance.


-1.0 +1.0
↑ Eigenvectors (PCA)
— Regression Line
• Outlier (Draggable)

6. Summary

  • Expectation: The “Center of Gravity”.
  • Variance: The “Moment of Inertia” (Spread).
  • Correlation: The “Linear Relationship”. Sensitive to Outliers!
  • PCA (Principal Component Analysis): Uses the Covariance Matrix to find the “main axes” of the data, allowing us to reduce dimensions (e.g., 100D → 2D).

Next: Sampling & Hypothesis Testing →