|
|
RAG Context Window Optimization - Fitting More Docs in Less Tokens
Author: Venkata Sudhakar
RAG context window optimization ensures ShopMax India's retrieval pipeline fits the maximum useful information within the LLM's token budget. Claude and GPT-4 support large context windows (100k-200k tokens), but sending too many retrieved chunks increases cost and can cause the model to miss key information buried in the middle - a phenomenon called 'lost in the middle'. The goal is to pack the most relevant content in the first and last positions of the context, and trim aggressively in between.
Four practical techniques address context window constraints in RAG: (1) Token counting before submission - count tokens per chunk and enforce a budget; (2) Lost-in-the-middle mitigation - place the most relevant chunk first and second-most relevant last; (3) Deduplication - remove near-duplicate chunks that add tokens without new information; (4) Summary compression - for chunks that exceed budget, replace them with a one-sentence summary. These techniques are composable and should be applied in this order.
The following example implements token-budget-aware RAG for ShopMax India, using the anthropic token counting API to enforce a context limit and applying lost-in-the-middle mitigation by reordering chunks before submission.
It gives the following output,
Q: Which noise-cancelling headphones are available under Rs 25000?
Docs used: 5
A: The following noise-cancelling headphones are available under Rs 25,000:
- Bose QC45 at Rs 24,990 (Delhi, Chennai)
- Jabra Elite 85h at Rs 19,990 (Hyderabad, Mumbai)
- JBL Tour One M2 at Rs 22,990 (Chennai, Mumbai)
- Anker Soundcore Q45 at Rs 5,990 (pan-India)
- Sony WH-CH720N at Rs 9,990 (pan-India)
For ShopMax India, set your token budget based on the model tier being used - Haiku allows tighter budgets (800 tokens) for cost efficiency, while Opus can handle larger contexts (3000 tokens) for complex comparison queries. Cache token counts per document at index time so the budget enforcement step does not require an API call. Monitor the average number of documents that pass the budget gate per query category - if comparison queries consistently drop to 2 documents, increase the budget or switch to contextual compression to fit more content.
|
|