|
|
RAG Precision and Recall Testing - Measuring Retrieval Quality
Author: Venkata Sudhakar
RAG precision and recall testing measures how accurately the retriever fetches relevant documents for a given query. ShopMax India needs this to validate that their product Q and A system retrieves the correct product documents before the LLM even sees them - a retrieval failure guarantees a wrong answer regardless of how good the LLM is. Precision measures the fraction of retrieved documents that are actually relevant; recall measures the fraction of all relevant documents that were retrieved.
Building a RAG evaluation set requires defining ground truth: for each test query, specify which document IDs should ideally be retrieved. Precision@k is the number of relevant documents in the top-k results divided by k. Recall@k is the number of relevant documents in the top-k results divided by the total number of relevant documents for that query. Mean Reciprocal Rank (MRR) measures how high the first relevant document appears in the ranked list. These metrics together give a complete picture of retrieval quality.
The following example implements a precision and recall evaluation harness for ShopMax India's RAG retriever. A ground truth dataset maps queries to expected document IDs, and the harness computes Precision@3, Recall@3, and MRR for each query.
It gives the following output,
Query: noise-cancelling headphones under Rs 30000
Retrieved: ['D001', 'D002', 'D003'] | Relevant: {'D001', 'D002'}
P@3: 0.67 | R@3: 1.00 | MRR: 1.00
Query: Sony WH-1000XM5 battery life
Retrieved: ['D001', 'D002', 'D003'] | Relevant: {'D001'}
P@3: 0.33 | R@3: 1.00 | MRR: 1.00
Query: laptop with 32GB RAM in Mumbai
Retrieved: ['D005', 'D001', 'D004'] | Relevant: {'D005'}
P@3: 0.33 | R@3: 1.00 | MRR: 1.00
Query: Apple noise cancelling headphones price
Retrieved: ['D003', 'D001', 'D002'] | Relevant: {'D003'}
P@3: 0.33 | R@3: 1.00 | MRR: 1.00
Mean P@3: 0.42
Mean R@3: 1.00
Mean MRR: 1.00
For ShopMax India, build your evaluation set from historical customer queries - sample 200 real queries from your logs, manually annotate which product IDs should have been retrieved, and run this harness on each retriever change. A Precision@3 above 0.6 and Recall@3 above 0.8 are reasonable production targets. If recall is high but precision is low, your retriever is fetching too many irrelevant documents - reduce top_k or add a reranking step. If precision is high but recall is low, relevant documents are being missed - check your chunk size and embedding model quality on your specific product vocabulary.
|
|