In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Pipeline Caching with Redis for Low-Latency Responses

RAG Pipeline Caching with Redis for Low-Latency Responses

Author: Venkata Sudhakar

RAG pipeline caching with Redis eliminates repeated retrieval and LLM calls for identical or near-identical queries, dramatically cutting latency and API costs for ShopMax India. Product Q and A systems see high query repetition - thousands of customers ask 'what is the price of Samsung Galaxy S24' every day. Without caching, each query hits the vector store, re-ranks documents, and calls the LLM. With Redis caching, the first query pays the full cost and subsequent identical queries return in under 5ms from cache.

Two caching strategies suit RAG pipelines: exact-match caching uses the query string as the Redis key (works for frequently asked identical questions), and semantic caching uses query embeddings to find cached answers for semantically similar queries even with different phrasing. Exact-match caching is simpler and has zero false-positive risk; semantic caching has higher coverage but requires a similarity threshold to avoid returning irrelevant cached answers. Most production systems use exact-match caching first, then add semantic caching for high-traffic categories.

The following example implements exact-match Redis caching for ShopMax India's RAG pipeline. The cache stores (query, answer) pairs with a TTL of 1 hour, and cache hits bypass both retrieval and LLM calls entirely.

import anthropic
import redis
import hashlib
import time
from rank_bm25 import BM25Okapi

client = anthropic.Anthropic(api_key="sk-ant-...")
cache = redis.Redis(host="localhost", port=6379, decode_responses=True)
CACHE_TTL = 3600

product_docs = [
    "Samsung Galaxy S24 Ultra: 200MP camera, 12GB RAM, Rs 134999, available pan-India.",
    "iPhone 15 Pro: 48MP camera, 8GB RAM, Rs 134900, available pan-India.",
    "OnePlus 12: 50MP camera, 16GB RAM, Rs 64999, available in Mumbai and Bangalore.",
    "Google Pixel 8 Pro: 50MP camera, 12GB RAM, Rs 106999, available in Delhi and Hyderabad."
]
tokenized = [doc.lower().split() for doc in product_docs]
bm25 = BM25Okapi(tokenized)

def cache_key(query):
    return "rag:" + hashlib.md5(query.strip().lower().encode()).hexdigest()

def cached_rag(query, top_k=2):
    key = cache_key(query)
    cached = cache.get(key)
    if cached:
        return cached, True
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    context = "\n".join([product_docs[i] for i in idx])
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=150,
        system="You are ShopMax India assistant. Answer using only the provided context.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}]
    )
    answer = msg.content[0].text
    cache.setex(key, CACHE_TTL, answer)
    return answer, False

queries = [
    "What is the price of Samsung Galaxy S24 Ultra?",
    "How much RAM does the OnePlus 12 have?",
    "What is the price of Samsung Galaxy S24 Ultra?"
]

for q in queries:
    start = time.time()
    answer, from_cache = cached_rag(q)
    elapsed = (time.time() - start) * 1000
    source = "CACHE" if from_cache else "LLM"
    print(f"[{source}] {elapsed:.1f}ms | Q: {q}")
    print(f"A: {answer}")
    print()

It gives the following output,

[LLM] 842.3ms | Q: What is the price of Samsung Galaxy S24 Ultra?
A: The Samsung Galaxy S24 Ultra is priced at Rs 1,34,999 and is available pan-India.

[LLM] 763.1ms | Q: How much RAM does the OnePlus 12 have?
A: The OnePlus 12 has 16GB RAM.

[CACHE] 2.1ms | Q: What is the price of Samsung Galaxy S24 Ultra?
A: The Samsung Galaxy S24 Ultra is priced at Rs 1,34,999 and is available pan-India.

For ShopMax India in production, set cache TTL based on how often product data changes - pricing and stock TTL should be 15-30 minutes, while spec-based answers (RAM, battery life) can cache for 24 hours. Add a cache invalidation hook in your product update pipeline so that when a product price changes in the database, the corresponding cache keys are deleted immediately. Monitor your cache hit rate per query category - a hit rate below 20% for your top-10 query types suggests the TTL is too short or queries are too diverse for exact-match caching to be effective.

Send your comments, suggestions or queries regarding this site to [email protected].