The Transformer Architecture

[!NOTE] This module explores the core principles of The Transformer Architecture, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Bottleneck of Sequence Models: A War Story

Before Transformers took over the world, Natural Language Processing (NLP) relied heavily on RNNs (Recurrent Neural Networks) and LSTMs. Let’s look at a real-world problem we faced around 2016.

Imagine trying to translate a 50-page legal document using an RNN. An RNN reads a sentence sequentially, word by word. It takes the first word, processes it, passes a “hidden state” to the next step, takes the second word, and so on.

This created two massive problems in production:

The Information Bottleneck: By the time the RNN reached word 100, the “hidden state” (a fixed-size vector) was so compressed it essentially “forgot” what the first word was. For long-range dependencies—like linking a pronoun at the end of a paragraph to a noun at the beginning—RNNs failed miserably.
The Hardware Nightmare (Sequential Processing): Because step N depended on the output of step N-1, we couldn’t parallelize the processing. Even with a cluster of powerful GPUs, the GPU cores sat idle waiting for the previous word’s computation to finish. We were bottlenecked by sequential processing.

In 2017, Google researchers published the paper “Attention Is All You Need”, introducing the Transformer. It solved both problems simultaneously by completely discarding recurrence and processing the entire sequence at once (Parallelism), using a mechanism called Self-Attention to understand relationships between words regardless of how far apart they are.

2. The GPT Architecture (Decoder-Only)

GPT stands for Generative Pre-trained Transformer. The original Transformer (used for translation) had two halves: an Encoder (which reads the input) and a Decoder (which generates the output). OpenAI realized that if your only goal is to predict the next word (autoregression), you can throw away the Encoder entirely. GPT uses only a stack of Decoder blocks.

Let’s break down the anatomy of a request flowing through the GPT architecture:

Tokenization & Input Embedding: Text is split into chunks (tokens). Each token is mapped to a high-dimensional vector (e.g., 4096 dimensions in early models). This vector represents the “meaning” of the token.
Positional Encoding: Because the Transformer reads everything in parallel, it doesn’t know the order of the words. The dog bit the man looks identical to The man bit the dog. We inject sine and cosine frequencies (or learned position vectors) into the embeddings to give the model a sense of sequence and distance.
Self-Attention Layers: This is the core engine. Here, words dynamically “look at” each other to build context. We will deep dive into this shortly.
Feed-Forward Networks (FFN): After attention has mixed the context across words, each word’s representation passes through a massive Multi-Layer Perceptron (MLP) independently. If Self-Attention is about communication between tokens, the FFN is about computation and recalling memorized facts.
Output Projection & Softmax: The final vectors are projected back into a vocabulary-sized vector. We apply a Softmax function to get a probability distribution, and we pick the most likely next token.

3. Understanding Self-Attention: The Matrix “Filing System”

Imagine reading the sentence: “The animal didn’t cross the street because it was too tired.”

To understand what “it” refers to, you subconsciously link “it” back to “animal”.

In a simple model, “it” might refer to “street”.
In a Transformer, the Attention Mechanism calculates a score linking “it” to every other word.

The Analogy: A Key-Value Store

Think of Self-Attention as a fuzzy dictionary or filing system in a massive library. For every token, the model projects the input embedding into three distinct vectors using learned weight matrices (W_Q, W_K, W_V):

Query (Q): What am I looking for? (e.g., the word “it” says: “I am a pronoun, I am looking for a singular noun that can be tired.”)
Key (K): What do I contain? (e.g., the word “animal” says: “I am a singular noun, I am a living creature.”)
Value (V): What content should I pass along if selected? (e.g., the exact semantic features of “animal”.)

The Computation: The attention score is essentially the dot product of the Query and Key: Query • Key.

If the Query of “it” aligns perfectly with the Key of “animal”, the dot product is high.
We apply a Softmax function across all scores so they sum to 1 (e.g., “animal” gets 0.8, “street” gets 0.1).
We multiply these probabilities by the Value vectors and sum them up. The new representation for “it” is now an 80% blend of “animal” and a 10% blend of “street”. This is how LLMs resolve ambiguity and maintain context.

4. 🕹️ Interactive: Attention Visualizer

Hover over words to see which other words they are “attending” to (paying attention to).

Notice how “bank” in the first sentence attends to “money”.
Notice how “bank” in the second sentence attends to “river”.
This is how LLMs resolve ambiguity (Polysemy).

Hover over a word to visualize attention weights.

5. Code Example: Scaled Dot Product Attention

Here is a simplified Python implementation of the core mathematical operation.

import numpy as np
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(query, key, value):
  """
  Compute 'Scaled Dot Product Attention'
  Args:
    query: (batch_size, d_k)
    key:   (batch_size, d_k)
    value: (batch_size, d_v)
  """
  d_k = query.size(-1)

  # 1. MatMul: Calculate scores (Q * K^T)
  # The higher the dot product, the more related the words are.
  scores = torch.matmul(query, key.transpose(-2, -1))

  # 2. Scale: Divide by sqrt(d_k) for stability
  scores = scores / np.sqrt(d_k)

  # 3. Softmax: Convert scores to probabilities (sum to 1)
  attention_weights = F.softmax(scores, dim=-1)

  # 4. MatMul: Multiply weights by Value
  output = torch.matmul(attention_weights, value)

  return output, attention_weights

[!TIP] Why Scaled?: As the dimension d_k grows, the dot products get very large. This pushes the Softmax function into regions where it has extremely small gradients (vanishing gradients). Dividing by sqrt(d_k) keeps the values in a nice range.