Vector Databases

[!NOTE] Vector Databases are the long-term memory for AI. They store data not as rows and columns, but as mathematical points in a multi-dimensional space.

1. What are Embeddings?

Before understanding vector databases, you must understand embeddings. An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text.

  • Input: “Apple”
  • Output: [0.12, -0.45, 0.88, ...] (e.g., 1536 dimensions for OpenAI’s text-embedding-3-small)

The magic is that semantically similar words end up close together in this vector space.

  • “Dog” and “Puppy” → Close together.
  • “Dog” and “Car” → Far apart.

2. Interactive: Embedding Space Visualizer

Visualize how semantic search works in a simplified 2D space. Drag the Query Point (Red) to see which concepts are considered “similar” based on their distance.

Nearest Neighbor: -
Similarity: -
Tech
Food
Animals
Query

3. How Vector Search Works

Traditional databases (SQL) use Keyword Search (exact match or regex). Vector databases use Similarity Search.

Distance Metrics

To find “similar” vectors, we calculate the distance between them.

  1. Cosine Similarity: Measures the angle between two vectors.
    • Range: -1 to 1.
    • Use case: NLP, text similarity (magnitude doesn’t matter).
    • Formula: A · B / (||A|| * ||B||)
  2. Euclidean Distance (L2): Measures the straight-line distance.
    • Use case: Image clustering.
  3. Dot Product: Measures magnitude and direction.
    • Use case: Recommendation systems (where magnitude = rating).

Approximate Nearest Neighbor (ANN)

Searching millions of vectors by comparing every single one (Brute Force / KNN) is too slow. Vector DBs use ANN algorithms like HNSW (Hierarchical Navigable Small World).

  • Trade-off: Slightly less accurate (might miss the absolute #1 closest), but blazing fast (milliseconds).
  • Analogy: Instead of checking every house in the city, HNSW checks neighborhoods, then streets, then houses.

4. Vector DB Landscape

Database Type Open Source?
Pinecone Managed Service No
ChromaDB Local / Server Yes
Weaviate Server Yes
Milvus Server Yes
pgvector Postgres Extension Yes

5. Code Example: Using ChromaDB

Here is how you ingest text and search for it using chromadb in Python.

import chromadb

# 1. Setup
client = chromadb.Client()
collection = client.create_collection(name="demo")

# 2. Add Data (Embeddings happen automatically by default!)
collection.add(
    documents=["I love python programming", "I hate snakes", "Pizza is great"],
    metadatas=[{"category": "tech"}, {"category": "animals"}, {"category": "food"}],
    ids=["id1", "id2", "id3"]
)

# 3. Query
# Searching for "coding" should match "I love python programming"
results = collection.query(
    query_texts=["coding"],
    n_results=1
)

print(results['documents'])
# Output: [['I love python programming']]
# Note: "coding" and "python programming" are semantically close!

6. Inverted Index vs Vector Index

Feature Inverted Index (Elasticsearch) Vector Index (Pinecone)
Matches Exact keywords (“bank” ≠ “river bank”) Meanings (“bank” ≈ “finance”)
Handling Synonyms Needs manual list Automatic
Handling Typos Fuzzy matching required Robust to small errors
Best For Specific product codes, names Conceptual questions, recommendations

[!TIP] Hybrid Search is the best of both worlds. It combines keyword search (BM25) for precision with vector search for recall.

7. Next Steps

Now that we can retrieve data, how do we structure our RAG pipeline for complex queries? Learn about Advanced RAG Architectures next.