Tokenization

[!NOTE] This module explores the core principles of Tokenization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Text vs. Numbers

Computers don’t understand words; they understand numbers. Tokenization is the process of breaking down text into smaller units called tokens and assigning each a unique integer ID.

When you feed text into ChatGPT, the very first step is converting your string into a list of integers.

Why not just use words?

  • Vocabulary Size: There are millions of words in English (including names, medical terms, etc.). If every word was a token, the model’s “dictionary” would be too massive to compute efficiently.
  • Unknown Words: If the model sees a word it hasn’t seen before (e.g., “Uninstagrammable”), it would fail (“" token).

Why not just use characters?

  • Efficiency: Character-level models need to process much longer sequences. “Apple” is 1 word but 5 characters. This makes training 5x slower and reduces the context window’s effectiveness.

The Solution: Byte Pair Encoding (BPE)

Modern LLMs use Subword Tokenization (specifically BPE).

  • Common words are single tokens: ` “apple” ` → [1203]
  • Rare words are broken into chunks: ` “Uninstagrammable” ` → ` “Un”, “inst”, “agram”, “mable” `
  • Result: Efficient vocabulary (~50k-100k tokens) that can represent any string.

[!NOTE] Rule of Thumb: 1000 tokens ≈ 750 words.


2. 🕹️ Interactive: Tokenizer Playground

Visualize how different strategies break down text.

  • Character: Every letter is a token.
  • Word: Splits by space (naive).
  • BPE (Simulated): The smart way. Notice how “learning” is one token, but “tokenization” might be split.
Token Count: 0

3. Code Example: Tokenizing with Python

OpenAI uses a library called tiktoken to handle tokenization efficiently.

import tiktoken

# Load the encoding for GPT-4
enc = tiktoken.get_encoding("cl100k_base")

text = "Generative AI is transforming the world!"

# Encode: Text -> Integers
ids = enc.encode(text)
print(ids)
# Output: [34657, 15592, 374, 14594, 279, 1917, 0]
# (Note: Actual IDs may vary slightly depending on exact model version)

# Decode: Integers -> Text
decoded = enc.decode(ids)
print(decoded)
# Output: "Generative AI is transforming the world!"

# Count tokens
print(len(ids))
# Output: 7

4. Why This Matters

  1. Cost: LLM APIs (like OpenAI) charge per token (input + output). Knowing how to count them helps estimate costs.
  2. Performance: LLMs struggle with tasks that depend on character-level manipulation (like “Reverse the word ‘Lollipop’”).
    • Why?: To the LLM, “Lollipop” is a single token [9921]. It doesn’t “see” the L or the ps directly unless it breaks it down, which it often fails to do for single words.
  3. Math: LLMs often fail at simple math because numbers are tokenized inconsistently.
    • 100 might be one token.
    • 101 might be two tokens (1, 01).
    • This makes learning arithmetic patterns purely from text very difficult.