In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Precision and Recall Testing - Measuring Retrieval Quality

RAG Precision and Recall Testing - Measuring Retrieval Quality

Author: Venkata Sudhakar

RAG precision and recall testing measures how accurately the retriever fetches relevant documents for a given query. ShopMax India needs this to validate that their product Q and A system retrieves the correct product documents before the LLM even sees them - a retrieval failure guarantees a wrong answer regardless of how good the LLM is. Precision measures the fraction of retrieved documents that are actually relevant; recall measures the fraction of all relevant documents that were retrieved.

Building a RAG evaluation set requires defining ground truth: for each test query, specify which document IDs should ideally be retrieved. Precision@k is the number of relevant documents in the top-k results divided by k. Recall@k is the number of relevant documents in the top-k results divided by the total number of relevant documents for that query. Mean Reciprocal Rank (MRR) measures how high the first relevant document appears in the ranked list. These metrics together give a complete picture of retrieval quality.

The following example implements a precision and recall evaluation harness for ShopMax India's RAG retriever. A ground truth dataset maps queries to expected document IDs, and the harness computes Precision@3, Recall@3, and MRR for each query.

from rank_bm25 import BM25Okapi

docs = {
    "D001": "Sony WH-1000XM5: 30-hour battery, noise-cancelling, Rs 29990, Mumbai, Bangalore.",
    "D002": "Bose QC45: 24-hour battery, noise-cancelling, Rs 24990, Delhi, Chennai.",
    "D003": "Apple AirPods Max: 20-hour battery, noise-cancelling, Rs 59900, pan-India.",
    "D004": "Samsung Galaxy S24: 50MP camera, Rs 79999, pan-India.",
    "D005": "Dell XPS 15: 32GB RAM, Rs 135000, Mumbai, Delhi."
}

doc_ids = list(docs.keys())
doc_texts = list(docs.values())
tokenized = [t.lower().split() for t in doc_texts]
bm25 = BM25Okapi(tokenized)

eval_set = [
    {"query": "noise-cancelling headphones under Rs 30000", "relevant": {"D001", "D002"}},
    {"query": "Sony WH-1000XM5 battery life", "relevant": {"D001"}},
    {"query": "laptop with 32GB RAM in Mumbai", "relevant": {"D005"}},
    {"query": "Apple noise cancelling headphones price", "relevant": {"D003"}}
]

def retrieve(query, top_k=3):
    scores = bm25.get_scores(query.lower().split())
    ranked = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [doc_ids[i] for i in ranked]

def evaluate(eval_set, top_k=3):
    precision_scores, recall_scores, mrr_scores = [], [], []
    for item in eval_set:
        retrieved = retrieve(item["query"], top_k)
        relevant = item["relevant"]
        hits = [1 if r in relevant else 0 for r in retrieved]
        precision = sum(hits) / top_k
        recall = sum(hits) / len(relevant)
        mrr = 0.0
        for rank, r in enumerate(retrieved, 1):
            if r in relevant:
                mrr = 1.0 / rank
                break
        precision_scores.append(precision)
        recall_scores.append(recall)
        mrr_scores.append(mrr)
        print(f"Query: {item['query'][:50]}")
        print(f"  Retrieved: {retrieved} | Relevant: {relevant}")
        print(f"  P@{top_k}: {precision:.2f} | R@{top_k}: {recall:.2f} | MRR: {mrr:.2f}")
    n = len(eval_set)
    print(f"\nMean P@{top_k}: {sum(precision_scores)/n:.2f}")
    print(f"Mean R@{top_k}: {sum(recall_scores)/n:.2f}")
    print(f"Mean MRR: {sum(mrr_scores)/n:.2f}")

evaluate(eval_set)

It gives the following output,

Query: noise-cancelling headphones under Rs 30000
  Retrieved: ['D001', 'D002', 'D003'] | Relevant: {'D001', 'D002'}
  P@3: 0.67 | R@3: 1.00 | MRR: 1.00

Query: Sony WH-1000XM5 battery life
  Retrieved: ['D001', 'D002', 'D003'] | Relevant: {'D001'}
  P@3: 0.33 | R@3: 1.00 | MRR: 1.00

Query: laptop with 32GB RAM in Mumbai
  Retrieved: ['D005', 'D001', 'D004'] | Relevant: {'D005'}
  P@3: 0.33 | R@3: 1.00 | MRR: 1.00

Query: Apple noise cancelling headphones price
  Retrieved: ['D003', 'D001', 'D002'] | Relevant: {'D003'}
  P@3: 0.33 | R@3: 1.00 | MRR: 1.00

Mean P@3: 0.42
Mean R@3: 1.00
Mean MRR: 1.00

For ShopMax India, build your evaluation set from historical customer queries - sample 200 real queries from your logs, manually annotate which product IDs should have been retrieved, and run this harness on each retriever change. A Precision@3 above 0.6 and Recall@3 above 0.8 are reasonable production targets. If recall is high but precision is low, your retriever is fetching too many irrelevant documents - reduce top_k or add a reranking step. If precision is high but recall is low, relevant documents are being missed - check your chunk size and embedding model quality on your specific product vocabulary.

Send your comments, suggestions or queries regarding this site to [email protected].