Pre-Training

[!IMPORTANT] The “Magic” of modern AI comes from Self-Supervised Learning. We don’t need labeled data; we just need text.

1. The Paradigm Shift

Before 2018, training a language model was like teaching a toddler a highly specialized skill—like performing calculus—before they even knew how to speak. We trained models from scratch for every task.

  • Old Way: Initialize random weights → Train on specific task (e.g., Sentiment Analysis). The model has to learn both the English language AND what “sentiment” means at the exact same time.
  • New Way (Transfer Learning): Pre-train on the entire internet → Fine-tune on specific task. This is like sending a child to elementary school to learn general reading, writing, and logic, and then teaching them calculus.

This process allows the model to learn grammar, facts, and reasoning during pre-training, which it then adapts to downstream tasks.

2. Encoder-Only: BERT (Auto-Encoding)

BERT (Bidirectional Encoder Representations from Transformers) is designed to understand text. It is the Proofreader. It sees the entire sentence at once, looking both left and right simultaneously.

Objective: Masked Language Modeling (MLM)

We randomly hide 15% of the tokens and ask the model to guess them.

  • Input: The [MASK] sat on the mat.
  • Target: cat

This forces the model to use bidirectional context (left and right) to infer the missing word.

Interactive: Masked Word Predictor

Try to be a BERT model. Guess the masked word based on context.

Sentence: "The [MASK] is chasing the mouse."

Model Probabilities (Mock):

3. Decoder-Only: GPT (Auto-Regressive)

GPT (Generative Pre-trained Transformer) is designed to generate text. It is the Improviser. It cannot look ahead; it can only build upon what has already been said.

Objective: Causal Language Modeling (CLM)

Predict the next token based strictly on all previous tokens.

  • Input: The cat sat on the
  • Target: mat

This is inherently harder than BERT’s Masked Language Modeling because the model has less context (it is blind to the future). However, this constraint is exactly what enables text generation.

The Auto-Regressive Engine

“Auto-regressive” simply means that the output of one step becomes the input of the next step. It is a loop:

  1. Input: The → Predict: cat
  2. Input: The cat → Predict: sat
  3. Input: The cat sat → Predict: on

Every time GPT generates a word, it appends that word to its own prompt and feeds the entire sequence back into itself.

Auto-Regressive Generation

Click "Generate Next Word" to see how GPT builds a sentence token by token.

Current Context: The

4. Encoder-Decoder: T5 (Seq2Seq)

T5 (Text-to-Text Transfer Transformer) is the Multi-Tool. It treats absolutely every NLP problem as a “text-to-text” problem. You feed it a text string, and it spits out a text string.

Objective: Span Corruption

During pre-training, T5 uses a variant of MLM called Span Corruption. Instead of masking individual words, it masks consecutive spans of text and asks the decoder to generate those missing spans sequentially.

  • Input: The [MASK_0] sat on the [MASK_1].
  • Target: <X> cat <Y> mat <Z>

The Text-to-Text Framework

Because T5 is an Encoder-Decoder, it naturally handles sequence-to-sequence mappings. During fine-tuning, the task itself is embedded in the input as a text prefix:

  • Translation: Input: "translate English to German: That is good."Output: "Das ist gut."
  • Summarization: Input: "summarize: The quick brown fox jumps over the lazy dog..."Output: "A fox jumps."
  • Classification: Input: "cola sentence: The course is jumping well."Output: "acceptable"

This elegant framing means you can use the exact same model, loss function, and hyperparameters for radically different NLP tasks.

5. Scaling Laws

Why do models keep getting bigger?

Researchers found that model performance (loss) scales as a power law with:

  1. N: Number of parameters.
  2. D: Dataset size.
  3. C: Compute used.

[!TIP] The “Chinchilla” scaling laws suggest that for every doubling of model size, you should also double the training data to be compute-optimal.

6. Implementation: Masking Logic

Here is how you might implement the MLM masking logic in PyTorch.

import torch

def mask_tokens(inputs, tokenizer, mlm_probability=0.15):
  """
  Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
  """
  labels = inputs.clone()

  # Create a mask array
  probability_matrix = torch.full(labels.shape, mlm_probability)

  # Find special tokens (CLS, SEP, PAD) -> Do NOT mask them
  special_tokens_mask = [
    tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True)
    for val in labels.tolist()
  ]
  probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)

  # Select tokens to mask
  masked_indices = torch.bernoulli(probability_matrix).bool()
  labels[~masked_indices] = -100  # We only compute loss on masked tokens

  # 80% of the time, replace with [MASK]
  indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
  inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

  # 10% of the time, replace with random word
  indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
  random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
  inputs[indices_random] = random_words[indices_random]

  # The remaining 10% stay as the original word (but we still predict them)

  return inputs, labels

7. Summary

Model Architecture Objective Best For
BERT Encoder-Only Masked LM Understanding, Classification, NER
GPT Decoder-Only Causal LM Generation, Completion
T5 Enc-Dec Span Corruption Translation, Summarization