In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > End-to-End RAG Pipeline Testing and Benchmarking

End-to-End RAG Pipeline Testing and Benchmarking

Author: Venkata Sudhakar

End-to-end RAG pipeline testing validates the entire system from query input to final answer quality, not just the retrieval step in isolation. ShopMax India needs this to catch failures that only manifest when retrieval and generation interact - for example, when retrieved documents are individually relevant but contradict each other, causing the LLM to hallucinate a compromise answer. A complete test suite covers retrieval quality, answer faithfulness, latency, and cost in a single automated run.

A comprehensive RAG test suite has four layers: unit tests (retriever returns expected document IDs), integration tests (retriever plus LLM produces grounded answers), performance tests (latency under concurrent load), and regression tests (answer quality does not degrade after any change). Tools like pytest for test orchestration, RAGAS for automated answer quality scoring, and a golden dataset of query-answer pairs form the foundation. Running this suite in CI/CD ensures no code merge degrades the RAG system quality.

The following example builds a complete end-to-end test suite for ShopMax India's RAG pipeline. It tests retrieval accuracy, answer groundedness, answer completeness, and latency in a single pytest-compatible test file.

import anthropic
import time
from rank_bm25 import BM25Okapi

client = anthropic.Anthropic(api_key="sk-ant-...")

product_docs = [
    {"id": "D001", "text": "Sony WH-1000XM5: 30-hour battery, ANC, Rs 29990, Mumbai and Bangalore."},
    {"id": "D002", "text": "Samsung Galaxy S24: 200MP camera, 12GB RAM, Rs 134999, pan-India."},
    {"id": "D003", "text": "Dell XPS 15: 32GB RAM, 1TB SSD, Rs 135000, Delhi and Mumbai."},
    {"id": "D004", "text": "Apple iPhone 15 Pro: USB-C, 48MP, Rs 134900, pan-India."}
]
tokenized = [d["text"].lower().split() for d in product_docs]
bm25 = BM25Okapi(tokenized)

def retrieve(query, top_k=2):
    scores = bm25.get_scores(query.lower().split())
    idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [product_docs[i] for i in idx]

def answer(query, docs):
    context = "\n".join([d["text"] for d in docs])
    msg = client.messages.create(
        model="claude-opus-4-7", max_tokens=150,
        system="You are ShopMax India assistant. Answer using only the provided context.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQ: {query}"}]
    )
    return msg.content[0].text

def score_groundedness(answer_text, context):
    msg = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=10,
        messages=[{"role": "user", "content": f"Is this answer fully supported by the context? Score 0-10 only.\nContext: {context}\nAnswer: {answer_text}"}]
    )
    try:
        return int(msg.content[0].text.strip().split()[0])
    except Exception:
        return 5

golden_tests = [
    {"query": "What is the battery life of Sony headphones?", "expected_doc": "D001", "expected_in_answer": "30"},
    {"query": "Which phone has a 200MP camera?", "expected_doc": "D002", "expected_in_answer": "Samsung"},
    {"query": "What RAM does the Dell laptop have?", "expected_doc": "D003", "expected_in_answer": "32GB"}
]

passed = 0
for test in golden_tests:
    start = time.time()
    docs = retrieve(test["query"])
    ans = answer(test["query"], docs)
    latency_ms = (time.time() - start) * 1000
    context = " ".join([d["text"] for d in docs])
    ground_score = score_groundedness(ans, context)
    retrieval_ok = any(d["id"] == test["expected_doc"] for d in docs)
    answer_ok = test["expected_in_answer"].lower() in ans.lower()
    ok = retrieval_ok and answer_ok and ground_score >= 7
    if ok:
        passed += 1
    status = "PASS" if ok else "FAIL"
    print(f"[{status}] {test['query'][:45]}")
    print(f"  Retrieval: {retrieval_ok} | Answer check: {answer_ok} | Groundedness: {ground_score}/10 | {latency_ms:.0f}ms")

print(f"\nResult: {passed}/{len(golden_tests)} tests passed")

It gives the following output,

[PASS] What is the battery life of Sony headphones?
  Retrieval: True | Answer check: True | Groundedness: 10/10 | 1243ms

[PASS] Which phone has a 200MP camera?
  Retrieval: True | Answer check: True | Groundedness: 10/10 | 987ms

[PASS] What RAM does the Dell laptop have?
  Retrieval: True | Answer check: True | Groundedness: 10/10 | 1102ms

Result: 3/3 tests passed

For ShopMax India, run this end-to-end test suite automatically on every pull request that touches the RAG pipeline, system prompt, or product data schema. Maintain a golden dataset of at least 50 query-answer pairs covering your top product categories, common failure modes, and edge cases. Set minimum thresholds: Retrieval Accuracy above 85%, Groundedness above 8/10, and latency under 3 seconds at the 95th percentile. When a test fails, the detailed output (which sub-check failed, what was retrieved, what was answered) pinpoints whether the issue is in retrieval, context compression, or the LLM prompt - making debugging fast and systematic.

Send your comments, suggestions or queries regarding this site to [email protected].