Capstone: Transformers & VAEs

1. Introduction: The Modern Titans

This entire module leads to this moment. The two architectures that define modern AI:

Transformers (GPT, BERT, Claude): The masters of Sequence and Context.
Variational Autoencoders (VAEs, Stable Diffusion): The masters of Generation and Latent Space.

They rely entirely on the math we just covered: Dot Products, Softmax, Entropy, and Gaussian Distributions.

2. Transformers: Attention is All You Need

Before 2017, we processed text sequentially (RNNs). “Read word 1, then word 2, then word 3…” This was slow and forgot long-term context.

The Transformer reads the entire sentence at once (in parallel). But how does it know the order? And how does it know “it” refers to “animal”?

The Core Mechanism: Self-Attention

Imagine a database lookup.

Query (Q): What I am looking for? (e.g., “it”)
Key (K): What defines this word? (e.g., “animal”, “street”)
Value (V): The actual content.

We compute the similarity between the Query and every Key using a Dot Product.

    Attention(Q, K, V) = softmax( QKT / √dk ) V

QK^T: Compute similarity scores (Dot Product).
Divide by √d_k: Scale down so gradients don’t explode.
Softmax: Convert scores to probabilities (sum to 1).
Multiply by V: Get the weighted sum of meanings.

3. Interactive Visualizer: The Attention Graph

See how Self-Attention links words. Sentence: “The animal didn’t cross the street because it was too tired.”

Instructions:

Hover over any word (e.g., “it”).
See which other words light up.
- “it” attends strongly to “animal” (because animals get tired).
- It attends weakly to “street”.

This “link” is the Attention Weight. The thicker the line, the higher the Dot Product ($QK^T$).

Hover over a word to see its Attention Map.

4. VAE: The Reparameterization Trick

Variational Autoencoders try to compress data into a Latent Space ($z$). This latent space is a map of concepts (e.g., “Smile Vector”, “Age Vector”).

The Problem

We need to sample $z$ from a distribution $N(\mu, \sigma)$ to generate new images. But we can’t backpropagate gradients through a random node. Randomness breaks the chain rule.

The Solution: Reparameterization

Move the randomness aside. Instead of sampling directly, we define:

    z = μ + σ ⊙ ε

Where $\epsilon \sim N(0, 1)$ (Standard Noise). Now gradients can flow through $\mu$ and $\sigma$ to update the encoder, treating $\epsilon$ as a constant input.

The Loss Function (ELBO)

We minimize two things:

Reconstruction Loss: Does the output look like the input? (MSE or Cross-Entropy).
KL Divergence: Does the latent space look like a Normal Gaussian? (Regularization).

    Loss = ||x - x'||2 + DKL(N(μ, σ) || N(0, 1))

5. Summary

Transformers: Use dot products to find relationships between words.
Multi-Head Attention: Learning multiple relationships (Grammar, Meaning) in parallel.
VAEs: Use Gaussian tricks to generate new data from a structured latent space.

Next: Review & Cheat Sheet →