The Zoo of Distributions

1. Introduction: Describing Randomness

In the previous chapter, we learned how to calculate probabilities. Now, we learn what to calculate. Most real-world phenomena follow specific patterns. These patterns are called Probability Distributions.

A Random Variable X is a function that maps outcomes to numbers.

  • Discrete: X ∈ {0, 1, 2, ...} (e.g., Number of emails).
  • Continuous: X ∈ ℝ (e.g., Height, Temperature, Weights in a Neural Network).

PDF vs PMF

  • PMF (Probability Mass Function): For discrete variables. It gives the probability that a discrete random variable is exactly equal to some value.

    P(X=k)

  • PDF (Probability Density Function): For continuous variables. The probability at a single point is technically 0. We measure probability as the Area under the curve.

    P(a ≤ X ≤ b) = ∫ f(x)dx


2. Common Discrete Distributions

2.1 Bernoulli (p)

The “atom” of probability. A single trial with two outcomes: Success (1) or Failure (0).

  • Generative Story: You flip a biased coin once.
  • Parameter: p (probability of success).
  • ML Application: Logistic Regression outputs a Bernoulli probability P(Y=1|X). It models binary classification tasks like “Spam vs Ham”.

2.2 Binomial (n, p)

The sum of n independent Bernoulli trials.

  • Generative Story: You flip the same coin n times. How many heads do you get?
  • Formula:

    P(X=k) = C(n, k) · pk · (1-p)n-k

  • ML Application: Predicting the number of conversions from n ad impressions.

2.3 Poisson (λ)

Models the number of events happening in a fixed interval of time or space.

  • Generative Story: Events happen independently at a constant average rate.
  • Parameter: λ (lambda, average rate).
  • Example: Number of API requests per second to your server.
  • ML Application: Modeling count data (e.g., predicting call center volume or server load).

3. Continuous Distributions

3.1 The Gaussian (Normal) Distribution (&mu;, &sigma;<sup>2</sup>)

The “King of Distributions”. It is bell-shaped, symmetric, and defined by:

  1. Mean (&mu;): The center (Expectation).
  2. Variance (&sigma;<sup>2</sup>): The spread (Uncertainty).

f(x) = [1 / (σ √(2π))] · e-(x - μ)2 / (2σ2)

  • Why is it everywhere?: The Central Limit Theorem says that if you add up enough random things (regardless of their original distribution), the sum becomes Gaussian.
  • ML Application:
    • Weight Initialization: We initialize Neural Network weights from N(0, 1) or Xavier/He Normal to ensure stable training.
    • Error Analysis: We assume noise is Gaussian ($y = mx + b + ε$, where &epsilon; ~ N(0, &sigma;<sup>2</sup>)) in Linear Regression.

3.2 Exponential Distribution (&lambda;)

Models the time between events in a Poisson process.

  • Generative Story: How long do you have to wait for the next bus (if buses arrive randomly)?
  • Parameter: &lambda; (rate parameter).
  • Memoryless Property: P(T > t+s | T > s) = P(T > t). Past waiting time doesn’t affect future waiting time.

4. Interactive Visualizer: The Distribution Explorer

Select a distribution and tweak the parameters to see how the shape changes. Toggle CDF: Switch between the Density/Mass (PDF/PMF) and the Cumulative Distribution Function (CDF).

5. Summary

  • Bernoulli: 1 coin flip.
  • Binomial: n coin flips.
  • Poisson: Counts per hour.
  • Gaussian: The Bell Curve (The sum of everything).
  • Exponential: Waiting time.

Next: Expectation & Variance →