What are LLMs?

[!NOTE] This module explores the core principles of What are LLMs?, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Autocomplete on Steroids

At their core, Large Language Models (LLMs) like GPT-4, Claude, and Llama are probabilistic next-token predictors. They do not “know” facts, have consciousness, or reason like humans do.

Imagine the autocomplete on your phone. If you type “I am going to the…”, it suggests “store”, “park”, or “gym”. It does this because it has seen thousands of sentences where “store” follows that phrase.

An LLM is simply a scaled-up version of this:

  1. Scale: Instead of thousands of sentences, it has read effectively the entire internet (trillions of words).
  2. Context: Instead of looking at just the last 2 words, it can look at thousands of previous words (Context Window) to make a prediction.
  3. Complexity: Instead of a simple frequency count, it uses a neural network with billions of parameters to understand nuance, tone, and logic.

[!NOTE] Key Concept: The model outputs a probability distribution over all possible next words (tokens). It then selects one based on your settings (Temperature).


2. How it Works: The Probability Game

When you ask an LLM “What is the capital of France?”, it doesn’t look up a database row Country: France, Capital: Paris.

Instead, it calculates: “Given the sequence ‘What is the capital of France?’, what is the most likely next token?”

  • “Paris” (99.9%)
  • “The” (0.01%)
  • “London” (0.001%)

It picks “Paris”. Then it feeds “What is the capital of France? Paris” back into itself to predict the next token (maybe “.” or “is”). This is why it’s called Autoregressive Generation.

🕹️ Interactive: The Next Token Predictor

Simulate how an LLM “thinks”. Click on a suggested token to append it to the sequence and see how the probabilities change.

The quick brown fox

Top Probabilities for Next Token


3. Code Example: Generating Text with Python

Using the Hugging Face transformers library, we can load a model like GPT-2 (a smaller ancestor of GPT-4) and see this in action.

from transformers import pipeline, set_seed

# Initialize the text generation pipeline with GPT-2
generator = pipeline('text-generation', model='gpt2')

# Set a seed for reproducibility
set_seed(42)

# The prompt (input text)
prompt = "The secret to happiness is"

# Generate text
# max_length=30: Limit output to 30 tokens
# num_return_sequences=1: Generate 1 possibility
response = generator(prompt, max_length=30, num_return_sequences=1)

print(response[0]['generated_text'])
# Output: "The secret to happiness is to be happy. It is not to be happy to be happy. It is to be happy to be happy."
# (Note: GPT-2 is small and sometimes repetitive!)

4. Key Terminology

1. Parameters

Think of parameters as the “brain cells” or “synapses” of the model. They are numerical values (weights) that the model adjusts during training to minimize errors.

  • GPT-2: 1.5 Billion parameters.
  • GPT-3: 175 Billion parameters.
  • GPT-4: Estimated 1.8 Trillion parameters.

More parameters generally mean higher reasoning capability and broader knowledge.

2. Training Data

These models are trained on massive datasets (Common Crawl, Books, Wikipedia, GitHub code).

  • The Goal: “Read” the internet and learn the statistical relationships between words.
  • The Result: The model learns grammar, facts, reasoning patterns, and even coding syntax implicitly.

3. Hallucination

Because the model is probabilistic, it can confidently state things that are false.

  • Why?: If you ask “Who was the first person on Mars?”, the model has seen many sci-fi stories. It might predict “John Boone” (from a novel) because that completion has a high probability in its training data context, even if it’s factually wrong in the real world.