Vector Databases
[!NOTE] Vector Databases are the long-term memory for AI. They store data not as rows and columns, but as mathematical points in a multi-dimensional space.
1. What are Embeddings?
Before understanding vector databases, you must understand embeddings. An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text.
- Input: “Apple”
- Output:
[0.12, -0.45, 0.88, ...](e.g., 1536 dimensions for OpenAI’stext-embedding-3-small)
The magic is that semantically similar words end up close together in this vector space.
- “Dog” and “Puppy” → Close together.
- “Dog” and “Car” → Far apart.
2. Interactive: Embedding Space Visualizer
Visualize how semantic search works in a simplified 2D space. Drag the Query Point (Red) to see which concepts are considered “similar” based on their distance.
3. How Vector Search Works
Traditional databases (SQL) use Keyword Search (exact match or regex). Vector databases use Similarity Search.
Distance Metrics
To find “similar” vectors, we calculate the distance between them.
- Cosine Similarity: Measures the angle between two vectors.
- Range: -1 to 1.
- Use case: NLP, text similarity (magnitude doesn’t matter).
- Formula:
A · B / (||A|| * ||B||)
- Euclidean Distance (L2): Measures the straight-line distance.
- Use case: Image clustering.
- Dot Product: Measures magnitude and direction.
- Use case: Recommendation systems (where magnitude = rating).
Approximate Nearest Neighbor (ANN)
Searching millions of vectors by comparing every single one (Brute Force / KNN) is too slow. Vector DBs use ANN algorithms like HNSW (Hierarchical Navigable Small World).
- Trade-off: Slightly less accurate (might miss the absolute #1 closest), but blazing fast (milliseconds).
- Analogy: Instead of checking every house in the city, HNSW checks neighborhoods, then streets, then houses.
4. Vector DB Landscape
| Database | Type | Open Source? |
|---|---|---|
| Pinecone | Managed Service | No |
| ChromaDB | Local / Server | Yes |
| Weaviate | Server | Yes |
| Milvus | Server | Yes |
| pgvector | Postgres Extension | Yes |
5. Code Example: Using ChromaDB
Here is how you ingest text and search for it using chromadb in Python.
import chromadb
# 1. Setup
client = chromadb.Client()
collection = client.create_collection(name="demo")
# 2. Add Data (Embeddings happen automatically by default!)
collection.add(
documents=["I love python programming", "I hate snakes", "Pizza is great"],
metadatas=[{"category": "tech"}, {"category": "animals"}, {"category": "food"}],
ids=["id1", "id2", "id3"]
)
# 3. Query
# Searching for "coding" should match "I love python programming"
results = collection.query(
query_texts=["coding"],
n_results=1
)
print(results['documents'])
# Output: [['I love python programming']]
# Note: "coding" and "python programming" are semantically close!
6. Inverted Index vs Vector Index
| Feature | Inverted Index (Elasticsearch) | Vector Index (Pinecone) |
|---|---|---|
| Matches | Exact keywords (“bank” ≠ “river bank”) | Meanings (“bank” ≈ “finance”) |
| Handling Synonyms | Needs manual list | Automatic |
| Handling Typos | Fuzzy matching required | Robust to small errors |
| Best For | Specific product codes, names | Conceptual questions, recommendations |
[!TIP] Hybrid Search is the best of both worlds. It combines keyword search (BM25) for precision with vector search for recall.
7. Next Steps
Now that we can retrieve data, how do we structure our RAG pipeline for complex queries? Learn about Advanced RAG Architectures next.