In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Incremental Indexing - Adding New Documents Without Full Rebuild

RAG Incremental Indexing - Adding New Documents Without Full Rebuild

Author: Venkata Sudhakar

RAG incremental indexing allows ShopMax India to add new products and update existing documents without rebuilding the entire vector index from scratch. Full rebuilds are expensive - a catalog of 100,000 products takes minutes to re-embed and re-index. When a new laptop model launches or a price changes, incremental indexing updates only the affected documents in seconds, keeping the RAG system current without downtime or full rebuild cost.

Incremental indexing requires tracking which documents are already in the index. The standard approach uses a document ID and a content hash: when a new document arrives, compute its hash and compare to the stored hash. If the hash matches, skip it. If the hash differs or the ID is new, upsert the document (update if existing, insert if new). ChromaDB supports upsert natively via its collection.upsert() method. For deleted products, maintain a deletion log and periodically purge deleted IDs from the index.

The following example demonstrates incremental indexing for ShopMax India product documents using ChromaDB. The indexer tracks document hashes and only re-embeds documents that have changed since the last index update.

import chromadb
import hashlib
import json
from sentence_transformers import SentenceTransformer

client_db = chromadb.Client()
collection = client_db.get_or_create_collection("shopmax_products")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
hash_store = {}

def doc_hash(text):
    return hashlib.md5(text.encode()).hexdigest()

def incremental_index(products):
    added = 0
    updated = 0
    skipped = 0
    for product in products:
        doc_id = product["id"]
        text = f"{product['name']}: {product['specs']} Price Rs {product['price']}. City: {product['city']}"
        h = doc_hash(text)
        if hash_store.get(doc_id) == h:
            skipped += 1
            continue
        embedding = embedder.encode([text])[0].tolist()
        collection.upsert(
            ids=[doc_id],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"name": product["name"], "price": product["price"]}]
        )
        is_update = doc_id in hash_store
        hash_store[doc_id] = h
        if is_update:
            updated += 1
        else:
            added += 1
    return {"added": added, "updated": updated, "skipped": skipped}

initial_products = [
    {"id": "P001", "name": "Sony WH-1000XM5", "specs": "30hr battery, ANC", "price": 29990, "city": "Mumbai"},
    {"id": "P002", "name": "Samsung Galaxy S24", "specs": "200MP camera, 12GB RAM", "price": 134999, "city": "Delhi"}
]
result = incremental_index(initial_products)
print("Initial index:", json.dumps(result))

updated_products = [
    {"id": "P001", "name": "Sony WH-1000XM5", "specs": "30hr battery, ANC", "price": 27990, "city": "Mumbai"},
    {"id": "P002", "name": "Samsung Galaxy S24", "specs": "200MP camera, 12GB RAM", "price": 134999, "city": "Delhi"},
    {"id": "P003", "name": "Dell XPS 15", "specs": "32GB RAM, 1TB SSD", "price": 135000, "city": "Bangalore"}
]
result2 = incremental_index(updated_products)
print("After update:", json.dumps(result2))
print("Total docs in index:", collection.count())

It gives the following output,

Initial index: {"added": 2, "updated": 0, "skipped": 0}
After update: {"added": 1, "updated": 1, "skipped": 1}
Total docs in index: 3

For ShopMax India at production scale, trigger incremental indexing from your product management system's webhook whenever a product is created, updated, or deactivated. Store the hash store in Redis or a database table rather than in memory so it persists across service restarts. Schedule a full consistency check weekly to catch any missed updates - compare hashes of all indexed documents against the source database and re-index any discrepancies. This ensures your RAG index stays within minutes of the live catalog without the cost of continuous full rebuilds.

Send your comments, suggestions or queries regarding this site to [email protected].