RAG Fundamentals

[!IMPORTANT] Retrieval-Augmented Generation (RAG) is the architecture that bridges the gap between an LLM’s frozen training data and your dynamic, proprietary data. It is the standard for building production AI applications.

1. The Problem with LLMs

Large Language Models (LLMs) like GPT-4 are incredibly powerful, but they have two fatal flaws when used in isolation:

Hallucination: They confidently make up facts when they don’t know the answer.
Knowledge Cutoff: Their training data is frozen in time. They don’t know about events that happened yesterday, or about your private company data.

Imagine an LLM as a brilliant scholar who has been locked in a library for 2 years. They know everything in those books, but nothing about the outside world since then. RAG is like giving that scholar an internet connection and a search engine.

2. What is RAG?

RAG is a technique that retrieves relevant information from an external knowledge base and provides it to the LLM as context before asking it to generate an answer.

The RAG Triad

Retriever: Finds the most relevant documents for the user’s query from a database.
Augmenter: Combines the user’s query with the retrieved documents into a single prompt.
Generator: The LLM takes the augmented prompt and generates a grounded response.

👤

User Query

→

🔍

Retriever

Search Vector DB

→

📄

Context

3. Interactive: RAG Simulator

Experience how RAG works step-by-step. Enter a query to see how the system retrieves data and generates an answer.

1. User Query

Waiting for input...

2. Retrieval (Vector DB)

Searching database...

3. Augmented Prompt

Waiting for retrieval...

4. LLM Response

Waiting for prompt...

4. Basic RAG Implementation

Here is a minimal example of a RAG system using Python. We use chromadb as our vector database and OpenAI for embeddings and generation.

import chromadb
from openai import OpenAI

# 1. Initialize Vector DB
client = chromadb.Client()
collection = client.create_collection("knowledge_base")

# 2. Add Documents (Ingestion)
documents = [
    "The refund policy allows returns within 30 days.",
    "Shipping is free for orders over $50.",
    "Support is available 24/7 via email."
]
collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3"]
)

# 3. Retrieval Function
def retrieve(query, n_results=1):
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results['documents'][0]

# 4. Generation Function
llm_client = OpenAI(api_key="sk-...")

def generate_answer(query):
    # Retrieve context
    context_docs = retrieve(query)
    context = "\n".join(context_docs)

    # Augment Prompt
    prompt = f"""
    Answer the question based ONLY on the context below.

    Context:
    {context}

    Question: {query}
    """

    # Generate
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Usage
print(generate_answer("How long do I have to return an item?"))
# Output: "You have 30 days to return an item based on the refund policy."

5. Why Not Just Fine-Tuning?

A common misconception is that you should fine-tune an LLM to teach it new knowledge.

Feature	RAG	Fine-Tuning
Goal	Connect LLM to dynamic data	Change LLM behavior/style
Knowledge Update	Instant (add doc to DB)	Slow (re-train model)
Accuracy	High (grounded in retrieved docs)	Lower (can still hallucinate)
Cost	Low (Vector DB + Inference)	High (Training compute)
Privacy	High (Data stays in DB)	Low (Data baked into model)

[!TIP] Use RAG for knowledge (facts, data). Use Fine-Tuning for behavior (tone, format, specific coding style).

6. Next Steps

In the next chapter, we will dive deep into the engine that powers retrieval: Vector Databases.