Advanced RAG Architectures
[!WARNING] Naive RAG (simple retrieve-then-generate) fails in production. It struggles with complex queries, long documents, and conflicting information.
1. The Chunking Problem
The first step in RAG is splitting your documents into smaller pieces (“chunks”). How you do this drastically affects performance.
- Fixed-Size Chunking: Split every N characters. Fast but breaks sentences.
- Recursive Chunking: Split by paragraphs, then sentences. Preserves structure.
- Semantic Chunking: Split when the topic changes (using embeddings).
2. Interactive: Chunking Visualizer
Paste text below to see how different chunking strategies affect the context windows.
3. Improving Retrieval
Once data is chunked, we need to find the best chunks.
1. Hybrid Search (Keyword + Vector)
Vector search misses exact keyword matches (e.g., product part numbers “X-99” vs “X-98”).
- Solution: Run BM25 (Keyword) and Vector Search in parallel.
- Fusion: Combine results using Reciprocal Rank Fusion (RRF).
2. Query Expansion
Users ask vague questions (“Refund?”).
- Solution: Use an LLM to rewrite the query into multiple variations.
- “What is the refund policy?”
- “How do I get my money back?”
- “Return process steps.”
- Search for all variations and deduplicate results.
3. Re-ranking (The “Secret Sauce”)
Vector DB retrieval is fast but approximate.
- Step 1: Retrieve Top 50 documents using Vector Search (Fast).
- Step 2: Use a Cross-Encoder model (Slow but accurate) to score each document against the query.
- Step 3: Take the Top 5 from the re-ranked list to the LLM.
# Pseudo-code for Re-ranking
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# 1. Fast Retrieval
hits = vector_db.search(query, k=50)
# 2. Re-rank
pairs = [[query, doc.text] for doc in hits]
scores = cross_encoder.predict(pairs)
# 3. Sort and Slice
ranked_hits = sorted(zip(hits, scores), key=lambda x: x[1], reverse=True)
top_5 = ranked_hits[:5]
4. Architecture Diagram: Advanced RAG
User Query
→
→
Vector Search
Keyword Search
→
Fusion (RRF)
→
Re-ranker
(Cross-Encoder)
→
Generator
(LLM)
5. Modular RAG
In production, RAG is not a linear pipeline; it’s a DAG (Directed Acyclic Graph).
- Routing: “Is this query about math?” → Route to Wolfram Alpha tool. “Is it about history?” → Route to Vector DB.
- Self-RAG: The LLM generates an answer, then critiques itself. If the confidence is low, it searches again or says “I don’t know.”
6. Next Steps
Review everything you’ve learned in the Module Review.