What are LLMs?

[!NOTE] This module explores the core principles of What are LLMs?, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Autocomplete on Steroids

At their core, Large Language Models (LLMs) like GPT-4, Claude, and Llama are probabilistic next-token predictors. They do not “know” facts, have consciousness, or reason like humans do.

Imagine the autocomplete on your phone. If you type “I am going to the…”, it suggests “store”, “park”, or “gym”. It does this because it has seen thousands of sentences where “store” follows that phrase.

An LLM is simply a scaled-up version of this:

  1. Scale: Instead of thousands of sentences, it has read effectively the entire internet (trillions of words).
  2. Context: Instead of looking at just the last 2 words, it can look at thousands of previous words (Context Window) to make a prediction.
  3. Complexity: Instead of a simple frequency count, it uses a neural network with billions of parameters to understand nuance, tone, and logic.

[!NOTE] Key Concept: The model outputs a probability distribution over all possible next words (tokens). It then selects one based on your settings (Temperature).


2. How it Works: The Probability Game

When you ask an LLM “What is the capital of France?”, it doesn’t look up a database row Country: France, Capital: Paris.

Instead, it calculates: “Given the sequence ‘What is the capital of France?’, what is the most likely next token?”

  • “Paris” (99.9%)
  • “The” (0.01%)
  • “London” (0.001%)

It picks “Paris”. Then it feeds “What is the capital of France? Paris” back into itself to predict the next token (maybe “.” or “is”). This is why it’s called Autoregressive Generation.

🕹️ Interactive: The Next Token Predictor

Simulate how an LLM “thinks”. Click on a suggested token to append it to the sequence and see how the probabilities change.

The quick brown fox

Top Probabilities for Next Token


Controlling the Output: Sampling Strategies

While the model outputs probabilities, we don’t always want it to pick the most likely token (99.9%). If it always did, it would be boring and repetitive. We use sampling strategies to inject creativity:

| Strategy | How it Works | Analogy | Use Case | | :--- | :--- | :--- | :--- | | **Temperature** | Adjusts the underlying probability values before selection. High temperature (e.g., 0.8) flattens the curve, giving low-probability tokens a better chance. Low temperature (e.g., 0.1) sharpens the curve, making the top token almost guaranteed. | **The Brainstormer**. High temp is a brainstorming session (wild ideas). Low temp is a math test (exact answers). | Creative writing (High), Coding/Math (Low) | | **Top-K** | Sorts the probabilities and only considers the top *K* tokens (e.g., K=40). It ignores the "long tail" of highly unlikely words. | **The Shortlist**. A recruiter only interviewing the top 40 candidates out of 1000 applicants. | General text generation to avoid absolute gibberish. | | **Top-P (Nucleus)** | Computes the cumulative probability and only considers tokens whose sum exceeds *P* (e.g., P=0.9). It dynamically shrinks or expands the pool of choices based on how "certain" the model is. | **The Confidence Threshold**. A doctor only considering diagnoses that together make up 90% of the likely causes. | High-quality text generation (often preferred over Top-K). |

3. Code Example: Generating Text with Python

Using the Hugging Face transformers library, we can load a model like GPT-2 (a smaller ancestor of GPT-4) and see this in action.

from transformers import pipeline, set_seed

# Initialize the text generation pipeline with GPT-2
generator = pipeline('text-generation', model='gpt2')

# Set a seed for reproducibility
set_seed(42)

# The prompt (input text)
prompt = "The secret to happiness is"

# Generate text
# max_length=30: Limit output to 30 tokens
# num_return_sequences=1: Generate 1 possibility
response = generator(prompt, max_length=30, num_return_sequences=1)

print(response[0]['generated_text'])
# Output: "The secret to happiness is to be happy. It is not to be happy to be happy. It is to be happy to be happy."
# (Note: GPT-2 is small and sometimes repetitive!)

4. Key Terminology

1. Parameters

Think of parameters as the “brain cells” or “synapses” of the model. They are numerical values (weights) that the model adjusts during training to minimize errors.

  • GPT-2: 1.5 Billion parameters.
  • GPT-3: 175 Billion parameters.
  • GPT-4: Estimated 1.8 Trillion parameters.

More parameters generally mean higher reasoning capability and broader knowledge.

2. Training Data

These models are trained on massive datasets (Common Crawl, Books, Wikipedia, GitHub code).

  • The Goal: “Read” the internet and learn the statistical relationships between words.
  • The Result: The model learns grammar, facts, reasoning patterns, and even coding syntax implicitly.

3. Hallucination

Because the model is probabilistic, it can confidently state things that are false.

  • Why?: If you ask “Who was the first person on Mars?”, the model has seen many sci-fi stories. It might predict “John Boone” (from a novel) because that completion has a high probability in its training data context, even if it’s factually wrong in the real world.