|
|
Vector Databases - ChromaDB and pgvector
Author: Venkata Sudhakar
A vector database stores high-dimensional numerical vectors (embeddings) and enables fast similarity search over them. When you convert a document chunk or a user query into an embedding using a model like text-embedding-3-small, the result is a vector of 1536 floating-point numbers. Finding the most semantically similar documents means finding the vectors that are closest to the query vector in that high-dimensional space, measured by cosine similarity or L2 distance. Traditional relational databases and even standard search indexes cannot do this efficiently. Vector databases are purpose-built for it, using approximate nearest neighbour (ANN) indexes like HNSW and IVFFlat to search millions of vectors in milliseconds. ChromaDB is a lightweight, open-source vector database ideal for development, prototyping, and small-to-medium RAG applications. It runs in-process with no server required, stores data on disk, and integrates natively with LangChain and LlamaIndex. pgvector is a PostgreSQL extension that adds vector storage and similarity search directly to a standard PostgreSQL database. If your production system already uses PostgreSQL, pgvector lets you store embeddings alongside your relational data in the same database, eliminating a separate vector DB to operate. The below example shows ChromaDB with persistent storage: adding documents with metadata, querying by similarity, and filtering by metadata - all three are critical for production RAG systems.
It gives the following output,
Collection has 5 documents
Top 3 similar docs:
[1] (distance=0.081) Debezium reads the MySQL binlog to capture inserts, updates and deletes...
Source: cdc_guide.pdf
[2] (distance=0.342) Blue-Green deployment switches traffic between two identical environment...
Source: deploy_guide.pdf
[3] (distance=0.389) Flyway applies versioned SQL migration scripts in order using a schema...
Source: schema_guide.pdf
The below example shows pgvector with PostgreSQL - creating the vector extension, a table with an embedding column, and querying with cosine similarity.
It gives the following output,
id | content | source | cosine_similarity
----+-------------------------------------------------------+----------------+------------------
1 | Debezium reads the MySQL binlog to capture CDC events. | cdc_guide.pdf | 0.9823
3 | CDC publishes change events to Apache Kafka topics. | kafka_guide.pdf| 0.8912
2 | Flyway manages schema migrations with versioned SQL. | schema_guide.pdf| 0.7234
4 | Blue-Green deployment uses two identical environments. | deploy_guide.pdf| 0.6891
5 | Kubernetes Deployments manage rolling updates safely. | k8s_guide.pdf | 0.6102
ChromaDB vs pgvector - when to choose each: Choose ChromaDB for local development, prototyping, and applications where the vector store is the only database. It requires zero infrastructure setup and has an excellent LangChain integration. Choose pgvector when your application already uses PostgreSQL and you want to avoid operating a separate database. pgvector lets you join vector search results with relational data in a single SQL query, which is a significant operational advantage in enterprise environments where all data lives in Postgres.
|
|