In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Context Window Optimization - Fitting More Docs in Less Tokens

RAG Context Window Optimization - Fitting More Docs in Less Tokens

Author: Venkata Sudhakar

RAG context window optimization ensures ShopMax India's retrieval pipeline fits the maximum useful information within the LLM's token budget. Claude and GPT-4 support large context windows (100k-200k tokens), but sending too many retrieved chunks increases cost and can cause the model to miss key information buried in the middle - a phenomenon called 'lost in the middle'. The goal is to pack the most relevant content in the first and last positions of the context, and trim aggressively in between.

Four practical techniques address context window constraints in RAG: (1) Token counting before submission - count tokens per chunk and enforce a budget; (2) Lost-in-the-middle mitigation - place the most relevant chunk first and second-most relevant last; (3) Deduplication - remove near-duplicate chunks that add tokens without new information; (4) Summary compression - for chunks that exceed budget, replace them with a one-sentence summary. These techniques are composable and should be applied in this order.

The following example implements token-budget-aware RAG for ShopMax India, using the anthropic token counting API to enforce a context limit and applying lost-in-the-middle mitigation by reordering chunks before submission.

import anthropic
from rank_bm25 import BM25Okapi

client = anthropic.Anthropic(api_key="sk-ant-...")
MAX_CONTEXT_TOKENS = 1500

product_docs = [
    "Sony WH-1000XM5: 30-hour battery, noise-cancelling, USB-C, Rs 29990, available in Mumbai and Bangalore.",
    "Bose QC45: 24-hour battery, noise-cancelling, Rs 24990, foldable design, available in Delhi and Chennai.",
    "Apple AirPods Max: 20-hour battery, H1 chip, Rs 59900, aluminium ear cups, available pan-India.",
    "Jabra Elite 85h: 36-hour battery, ANC, Rs 19990, rain resistant, available in Hyderabad and Mumbai.",
    "Sennheiser Momentum 4: 60-hour battery, ANC, Rs 32990, foldable, available in Bangalore and Delhi.",
    "JBL Tour One M2: 30-hour battery, ANC, Rs 22990, quick charge, available in Chennai and Mumbai.",
    "Anker Soundcore Q45: 40-hour battery, ANC, Rs 5990, budget pick, available pan-India.",
    "Sony WH-CH720N: 35-hour battery, ANC, Rs 9990, lightweight 192g, available pan-India."
]
tokenized = [doc.lower().split() for doc in product_docs]
bm25 = BM25Okapi(tokenized)

def count_tokens(text):
    resp = client.messages.count_tokens(
        model="claude-opus-4-7",
        messages=[{"role": "user", "content": text}]
    )
    return resp.input_tokens

def optimized_rag(query, top_k=8):
    scores = bm25.get_scores(query.lower().split())
    ranked = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    selected = []
    token_budget = MAX_CONTEXT_TOKENS
    for idx in ranked:
        chunk = product_docs[idx]
        tokens = count_tokens(chunk)
        if tokens <= token_budget:
            selected.append((scores[idx], chunk))
            token_budget -= tokens
        if token_budget < 50:
            break
    selected.sort(key=lambda x: x[0], reverse=True)
    if len(selected) >= 2:
        reordered = [selected[0][1]] + [s[1] for s in selected[2:]] + [selected[1][1]]
    else:
        reordered = [s[1] for s in selected]
    context = "\n".join(reordered)
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        system="You are ShopMax India assistant. Answer using only the provided context.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}]
    )
    return msg.content[0].text, len(reordered)

query = "Which noise-cancelling headphones are available under Rs 25000?"
answer, n_docs = optimized_rag(query)
print(f"Q: {query}")
print(f"Docs used: {n_docs}")
print(f"A: {answer}")

It gives the following output,

Q: Which noise-cancelling headphones are available under Rs 25000?
Docs used: 5
A: The following noise-cancelling headphones are available under Rs 25,000:
- Bose QC45 at Rs 24,990 (Delhi, Chennai)
- Jabra Elite 85h at Rs 19,990 (Hyderabad, Mumbai)
- JBL Tour One M2 at Rs 22,990 (Chennai, Mumbai)
- Anker Soundcore Q45 at Rs 5,990 (pan-India)
- Sony WH-CH720N at Rs 9,990 (pan-India)

For ShopMax India, set your token budget based on the model tier being used - Haiku allows tighter budgets (800 tokens) for cost efficiency, while Opus can handle larger contexts (3000 tokens) for complex comparison queries. Cache token counts per document at index time so the budget enforcement step does not require an API call. Monitor the average number of documents that pass the budget gate per query category - if comparison queries consistently drop to 2 documents, increase the budget or switch to contextual compression to fit more content.

Send your comments, suggestions or queries regarding this site to [email protected].