|
|
RAG Incremental Indexing - Adding New Documents Without Full Rebuild
Author: Venkata Sudhakar
RAG incremental indexing allows ShopMax India to add new products and update existing documents without rebuilding the entire vector index from scratch. Full rebuilds are expensive - a catalog of 100,000 products takes minutes to re-embed and re-index. When a new laptop model launches or a price changes, incremental indexing updates only the affected documents in seconds, keeping the RAG system current without downtime or full rebuild cost.
Incremental indexing requires tracking which documents are already in the index. The standard approach uses a document ID and a content hash: when a new document arrives, compute its hash and compare to the stored hash. If the hash matches, skip it. If the hash differs or the ID is new, upsert the document (update if existing, insert if new). ChromaDB supports upsert natively via its collection.upsert() method. For deleted products, maintain a deletion log and periodically purge deleted IDs from the index.
The following example demonstrates incremental indexing for ShopMax India product documents using ChromaDB. The indexer tracks document hashes and only re-embeds documents that have changed since the last index update.
It gives the following output,
Initial index: {"added": 2, "updated": 0, "skipped": 0}
After update: {"added": 1, "updated": 1, "skipped": 1}
Total docs in index: 3
For ShopMax India at production scale, trigger incremental indexing from your product management system's webhook whenever a product is created, updated, or deactivated. Store the hash store in Redis or a database table rather than in memory so it persists across service restarts. Schedule a full consistency check weekly to catch any missed updates - compare hashes of all indexed documents against the source database and re-index any discrepancies. This ensures your RAG index stays within minutes of the live catalog without the cost of continuous full rebuilds.
|
|