The Transformer Architecture

[!NOTE] This module explores the core principles of The Transformer Architecture, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. “Attention Is All You Need”

In 2017, Google researchers published a paper that changed everything. Before Transformers, we used RNNs (Recurrent Neural Networks) which processed text sequentially (word by word). They were slow and forgot the beginning of long sentences.

The Transformer processes the entire sentence at once (Parallelism) and uses a mechanism called Self-Attention to understand relationships between words regardless of how far apart they are.

2. The GPT Architecture (Decoder-Only)

GPT stands for Generative Pre-trained Transformer. Unlike the original Transformer (which translated text using an Encoder and Decoder), GPT uses only the Decoder stack.

  1. Input Embedding: Convert tokens to vectors.
  2. Positional Encoding: Add info about word order (since the model processes everything in parallel).
  3. Self-Attention Layers: The magic happens here. Words “look at” each other.
  4. Feed-Forward Networks: Process the information.
  5. Output: Predict the next token probability.

3. Understanding Self-Attention

Imagine reading the sentence: “The animal didn’t cross the street because it was too tired.”

To understand what “it” refers to, you subconsciously link “it” back to “animal”.

  • In a simple model, “it” might refer to “street”.
  • In a Transformer, the “Attention Mechanism” calculates a score linking “it” to every other word. The link between “it” and “animal” gets a high score (weight).

The Analogy: A Filing System

For every word, the model generates three vectors:

  1. Query (Q): What am I looking for? (e.g., “I am a pronoun looking for my noun”)
  2. Key (K): What do I contain? (e.g., “I am a noun, specifically an animal”)
  3. Value (V): What content should I pass along?

The attention score is essentially Query * Key. If they match, the Value is passed through.


4. 🕹️ Interactive: Attention Visualizer

Hover over words to see which other words they are “attending” to (paying attention to).

  • Notice how “bank” in the first sentence attends to “money”.
  • Notice how “bank” in the second sentence attends to “river”.
  • This is how LLMs resolve ambiguity (Polysemy).
Hover over a word to visualize attention weights.

5. Code Example: Scaled Dot Product Attention

Here is a simplified Python implementation of the core mathematical operation.

import numpy as np
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(query, key, value):
    """
    Compute 'Scaled Dot Product Attention'
    Args:
        query: (batch_size, d_k)
        key:   (batch_size, d_k)
        value: (batch_size, d_v)
    """
    d_k = query.size(-1)

    # 1. MatMul: Calculate scores (Q * K^T)
    # The higher the dot product, the more related the words are.
    scores = torch.matmul(query, key.transpose(-2, -1))

    # 2. Scale: Divide by sqrt(d_k) for stability
    scores = scores / np.sqrt(d_k)

    # 3. Softmax: Convert scores to probabilities (sum to 1)
    attention_weights = F.softmax(scores, dim=-1)

    # 4. MatMul: Multiply weights by Value
    output = torch.matmul(attention_weights, value)

    return output, attention_weights

[!TIP] Why Scaled?: As the dimension d_k grows, the dot products get very large. This pushes the Softmax function into regions where it has extremely small gradients (vanishing gradients). Dividing by sqrt(d_k) keeps the values in a nice range.