Tokenization
[!NOTE] This module explores the core principles of Tokenization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Text vs. Numbers
Computers don’t understand words; they understand numbers. Tokenization is the process of breaking down text into smaller units called tokens and assigning each a unique integer ID.
When you feed text into ChatGPT, the very first step is converting your string into a list of integers.
Why not just use words?
- Vocabulary Size: There are millions of words in English (including names, medical terms, etc.). If every word was a token, the model’s “dictionary” would be too massive to compute efficiently.
- Unknown Words: If the model sees a word it hasn’t seen before (e.g., “Uninstagrammable”), it would fail (“
" token).
Why not just use characters?
- Efficiency: Character-level models need to process much longer sequences. “Apple” is 1 word but 5 characters. This makes training 5x slower and reduces the context window’s effectiveness.
The Solution: Byte Pair Encoding (BPE)
Modern LLMs use Subword Tokenization (specifically BPE).
- Common words are single tokens: ` “apple” ` →
[1203] - Rare words are broken into chunks: ` “Uninstagrammable” ` → ` “Un”, “inst”, “agram”, “mable” `
- Result: Efficient vocabulary (~50k-100k tokens) that can represent any string.
[!NOTE] Rule of Thumb: 1000 tokens ≈ 750 words.
2. 🕹️ Interactive: Tokenizer Playground
Visualize how different strategies break down text.
- Character: Every letter is a token.
- Word: Splits by space (naive).
- BPE (Simulated): The smart way. Notice how “learning” is one token, but “tokenization” might be split.
Token Count: 0
3. Code Example: Tokenizing with Python
OpenAI uses a library called tiktoken to handle tokenization efficiently.
import tiktoken
# Load the encoding for GPT-4
enc = tiktoken.get_encoding("cl100k_base")
text = "Generative AI is transforming the world!"
# Encode: Text -> Integers
ids = enc.encode(text)
print(ids)
# Output: [34657, 15592, 374, 14594, 279, 1917, 0]
# (Note: Actual IDs may vary slightly depending on exact model version)
# Decode: Integers -> Text
decoded = enc.decode(ids)
print(decoded)
# Output: "Generative AI is transforming the world!"
# Count tokens
print(len(ids))
# Output: 7
4. Why This Matters
- Cost: LLM APIs (like OpenAI) charge per token (input + output). Knowing how to count them helps estimate costs.
- Performance: LLMs struggle with tasks that depend on character-level manipulation (like “Reverse the word ‘Lollipop’”).
- Why?: To the LLM, “Lollipop” is a single token
[9921]. It doesn’t “see” theLor theps directly unless it breaks it down, which it often fails to do for single words.
- Why?: To the LLM, “Lollipop” is a single token
- Math: LLMs often fail at simple math because numbers are tokenized inconsistently.
100might be one token.101might be two tokens (1,01).- This makes learning arithmetic patterns purely from text very difficult.