Tokenization
[!NOTE] This module explores the core principles of Tokenization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Text vs. Numbers
Computers don’t understand words; they understand numbers. Tokenization is the process of breaking down text into smaller units called tokens and assigning each a unique integer ID.
When you feed text into ChatGPT, the very first step is converting your string into a list of integers.
Why not just use words?
- Vocabulary Size: There are millions of words in English (including names, medical terms, etc.). If every word was a token, the model’s “dictionary” would be too massive to compute efficiently.
- Unknown Words: If the model sees a word it hasn’t seen before (e.g., “Uninstagrammable”), it would fail (“
" token).
Why not just use characters?
- Efficiency: Character-level models need to process much longer sequences. “Apple” is 1 word but 5 characters. This makes training 5x slower and reduces the context window’s effectiveness.
[!TIP] Analogy: Think of tokenization like packing a suitcase.
- Character-level is like packing single threads. It takes forever to pack (long context) and you can’t see the clothes.
- Word-level is like packing entire outfits rigidly. If you have an outfit you’ve never worn before (“Uninstagrammable”), you don’t know how to pack it.
- Subword (BPE) is like packing separate shirts and pants. You can combine “Un”, “inst”, “agram”, “mable” into any new outfit efficiently.
The Solution: Byte Pair Encoding (BPE)
Modern LLMs use Subword Tokenization (specifically BPE).
- Common words are single tokens:
"apple"→[1203] - Rare words are broken into chunks:
"Uninstagrammable"→"Un", "inst", "agram", "mable" - Result: Efficient vocabulary (~50k-100k tokens) that can represent any string.
[!NOTE] Rule of Thumb: 1000 tokens ≈ 750 words.
2. 🕹️ Interactive: Tokenizer Playground
Visualize how different strategies break down text.
- Character: Every letter is a token.
- Word: Splits by space (naive).
- BPE (Simulated): The smart way. Notice how “learning” is one token, but “tokenization” might be split.
3. Code Example: Tokenizing with Python
OpenAI uses a library called tiktoken to handle tokenization efficiently.
import tiktoken
# Load the encoding for GPT-4
enc = tiktoken.get_encoding("cl100k_base")
text = "Generative AI is transforming the world!"
# Encode: Text -> Integers
ids = enc.encode(text)
print(ids)
# Output: [34657, 15592, 374, 14594, 279, 1917, 0]
# (Note: Actual IDs may vary slightly depending on exact model version)
# Decode: Integers -> Text
decoded = enc.decode(ids)
print(decoded)
# Output: "Generative AI is transforming the world!"
# Count tokens
print(len(ids))
# Output: 7
4. Why This Matters
- Cost: LLM APIs (like OpenAI) charge per token (input + output). Knowing how to count them helps estimate costs.
- Performance: LLMs struggle with tasks that depend on character-level manipulation (like “Reverse the word ‘Lollipop’”).
- Why?: To the LLM, “Lollipop” is a single token
[9921]. It doesn’t “see” theLor theps directly unless it breaks it down, which it often fails to do for single words.
- Why?: To the LLM, “Lollipop” is a single token
- Math: LLMs often fail at simple math because numbers are tokenized inconsistently.
100might be one token.101might be two tokens (1,01).- This makes learning arithmetic patterns purely from text very difficult.
5. Case Study: Why “100” vs “101” is Hard
Let’s apply the PEDALS framework to understand why LLMs struggle with math due to tokenization.
- Process Requirements: An LLM needs to predict the next token to complete a mathematical operation, e.g.,
100 + 101 =. - Estimate: Since tokenization breaks text into common chunks, numbers are not always split by digits.
- Data Model: The vocabulary contains separate tokens for
100and101.100might map to[405], while101maps to[406]or[4]and[01]. - Architecture: The transformer receives these token embeddings. It sees
[405]and[4],[01]. It doesn’t inherently “know” that[405]is related to[4]and[01]mathematically. - Localized Details: Character-level tokenization (or digit-by-digit tokenization) would solve this, but it would increase the context window and slow down training.
- Scale: When processing millions of mathematical operations, inconsistent token boundaries prevent the model from learning generalizable arithmetic rules.
Anatomy of a Token
When a token like [34657] (Generative) is processed:
- String:
"Generative" - Integer ID:
34657(The index in the model’s vocabulary) - Embedding Vector:
[0.12, -0.45, ..., 0.89](A high-dimensional vector representing the token’s semantic meaning)
This is why prompt engineering often requires tricking the model into outputting intermediate steps (Chain of Thought), as it gives the transformer more tokens (and thus more computational steps) to arrive at the correct answer.