In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Vector Databases > Vector Databases - ChromaDB and pgvector

Vector Databases - ChromaDB and pgvector

Author: Venkata Sudhakar

A vector database stores high-dimensional numerical vectors (embeddings) and enables fast similarity search over them. When you convert a document chunk or a user query into an embedding using a model like text-embedding-3-small, the result is a vector of 1536 floating-point numbers. Finding the most semantically similar documents means finding the vectors that are closest to the query vector in that high-dimensional space, measured by cosine similarity or L2 distance. Traditional relational databases and even standard search indexes cannot do this efficiently. Vector databases are purpose-built for it, using approximate nearest neighbour (ANN) indexes like HNSW and IVFFlat to search millions of vectors in milliseconds.

ChromaDB is a lightweight, open-source vector database ideal for development, prototyping, and small-to-medium RAG applications. It runs in-process with no server required, stores data on disk, and integrates natively with LangChain and LlamaIndex. pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a standard PostgreSQL database. If your production system already uses PostgreSQL, pgvector lets you store embeddings alongside your relational data in the same database, eliminating a separate vector DB to operate.

The below example shows ChromaDB with persistent storage: adding documents with metadata, querying by similarity, and filtering by metadata - all three are critical for production RAG systems.

# pip install chromadb openai
import chromadb
from chromadb.utils import embedding_functions

# Persistent ChromaDB (survives restarts)
client = chromadb.PersistentClient(path="./chroma_storage")

# Use OpenAI embeddings (auto-embeds on add and query)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key-here",
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="tech_docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents with metadata (ChromaDB auto-generates embeddings)
collection.add(
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
    documents=[
        "Debezium reads the MySQL binlog to capture inserts, updates and deletes.",
        "Flyway applies versioned SQL migration scripts in order using a schema history table.",
        "Apache Kafka stores messages in partitioned, replicated topics for 7 days by default.",
        "Blue-Green deployment switches traffic between two identical environments via a load balancer.",
        "pgvector adds vector similarity search to PostgreSQL using HNSW and IVFFlat indexes."
    ],
    metadatas=[
        {"category": "cdc", "source": "cdc_guide.pdf"},
        {"category": "migration", "source": "schema_guide.pdf"},
        {"category": "messaging", "source": "kafka_guide.pdf"},
        {"category": "migration", "source": "deploy_guide.pdf"},
        {"category": "vector_db", "source": "pgvector_guide.pdf"}
    ]
)
print(f"Collection has {collection.count()} documents")

# Basic similarity search - top 3 most similar
results = collection.query(
    query_texts=["How does CDC work with MySQL?"],
    n_results=3
)
print("\nTop 3 similar docs:")
for i, (doc, meta, dist) in enumerate(zip(
    results["documents"][0], results["metadatas"][0], results["distances"][0])):
    print(f"  [{i+1}] (distance={dist:.3f}) {doc[:70]}...")
    print(f"       Source: {meta['source']}")

It gives the following output,

Collection has 5 documents

Top 3 similar docs:
  [1] (distance=0.081) Debezium reads the MySQL binlog to capture inserts, updates and deletes...
       Source: cdc_guide.pdf
  [2] (distance=0.342) Blue-Green deployment switches traffic between two identical environment...
       Source: deploy_guide.pdf
  [3] (distance=0.389) Flyway applies versioned SQL migration scripts in order using a schema...
       Source: schema_guide.pdf

The below example shows pgvector with PostgreSQL - creating the vector extension, a table with an embedding column, and querying with cosine similarity.

-- Step 1: Enable pgvector extension in PostgreSQL
CREATE EXTENSION IF NOT EXISTS vector;

-- Step 2: Create a table with a vector column
-- 1536 dimensions for text-embedding-3-small, 3072 for text-embedding-3-large
CREATE TABLE document_chunks (
    id          SERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    source      TEXT,
    category    TEXT,
    embedding   VECTOR(1536)
);

-- Step 3: Create an HNSW index for fast approximate nearest neighbour search
-- HNSW is faster for query but slower to build than IVFFlat
CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Step 4: Insert a document with its embedding
-- In practice the embedding is generated by your app using the OpenAI API
INSERT INTO document_chunks (content, source, category, embedding)
VALUES (
    'Debezium reads the MySQL binlog to capture CDC events.',
    'cdc_guide.pdf',
    'cdc',
    '[0.023, -0.041, 0.087, ...]'  -- 1536 floats from embedding model
);

-- Step 5: Query most similar documents (cosine similarity)
-- The <=> operator is the cosine distance operator from pgvector
SELECT
    id,
    content,
    source,
    1 - (embedding <=> '[0.019, -0.038, 0.091, ...]') AS cosine_similarity
FROM document_chunks
ORDER BY embedding <=> '[0.019, -0.038, 0.091, ...]'
LIMIT 5;

It gives the following output,

 id |                     content                           | source         | cosine_similarity
----+-------------------------------------------------------+----------------+------------------
  1 | Debezium reads the MySQL binlog to capture CDC events. | cdc_guide.pdf  | 0.9823
  3 | CDC publishes change events to Apache Kafka topics.   | kafka_guide.pdf| 0.8912
  2 | Flyway manages schema migrations with versioned SQL.  | schema_guide.pdf| 0.7234
  4 | Blue-Green deployment uses two identical environments. | deploy_guide.pdf| 0.6891
  5 | Kubernetes Deployments manage rolling updates safely. | k8s_guide.pdf  | 0.6102

ChromaDB vs pgvector - when to choose each:

Choose ChromaDB for local development, prototyping, and applications where the vector store is the only database. It requires zero infrastructure setup and has an excellent LangChain integration. Choose pgvector when your application already uses PostgreSQL and you want to avoid operating a separate database. pgvector lets you join vector search results with relational data in a single SQL query, which is a significant operational advantage in enterprise environments where all data lives in Postgres.

Send your comments, suggestions or queries regarding this site to [email protected].