In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > Building an LLM Evaluation Pipeline with RAGAS

Building an LLM Evaluation Pipeline with RAGAS

Author: Venkata Sudhakar

ShopMax India's product Q&A system retrieves answers from a knowledge base of product manuals, warranty documents, and FAQs. Evaluating whether those answers are accurate and grounded in retrieved context requires an automated evaluation pipeline. RAGAS provides ready-made metrics for faithfulness, answer relevancy, context precision, and context recall to measure RAG pipeline quality objectively.

RAGAS evaluates RAG pipelines using four core metrics scored from 0 to 1. Faithfulness checks whether the answer is grounded in the provided context. Answer Relevancy checks whether the answer addresses the question. Context Precision measures whether retrieved chunks are relevant. Context Recall checks whether all required information was retrieved. Scores are computed using an LLM-as-judge pattern and returned as a dataset with aggregate metrics.

The example below runs a RAGAS evaluation for ShopMax India's warranty Q&A system using a test dataset with questions, ground truth answers, generated answers, and retrieved context chunks.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": [
        "What is the warranty period for Samsung LED TVs at ShopMax India?",
        "How do I claim warranty for a defective laptop in Bangalore?"
    ],
    "answer": [
        "Samsung LED TVs come with a 1-year warranty and an optional 2-year extension at ShopMax India.",
        "Visit any ShopMax India service center in Bangalore with the original invoice and defective unit."
    ],
    "contexts": [
        ["Samsung TVs carry a 1-year warranty. Extended warranty up to 2 years available at ShopMax India."],
        ["Warranty claims require original invoice. Centers in Mumbai, Bangalore, Delhi, Hyderabad, Chennai."]
    ],
    "ground_truth": [
        "Samsung LED TVs have 1-year manufacturer warranty with optional 2-year extension.",
        "Bring the defective laptop and invoice to a ShopMax India service center in Bangalore."
    ]
}

dataset = Dataset.from_dict(data)
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print("RAGAS Evaluation - ShopMax India Warranty Q and A")
print("=" * 50)
for metric in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
    print(f"{metric:<22}: {result[metric]:.3f}")

It gives the following output,

RAGAS Evaluation - ShopMax India Warranty Q and A
==================================================
faithfulness          : 0.921
answer_relevancy      : 0.887
context_precision     : 0.893
context_recall        : 0.856

In production, build the evaluation dataset from real production queries sampled weekly. Automate RAGAS runs in CI/CD to catch regressions before deploying updated prompts or retrieval configs. Set threshold alerts - if faithfulness drops below 0.80 or answer relevancy below 0.75, fail the deployment. Store historical scores in a time-series database to track quality trends across model versions and knowledge base updates.

Send your comments, suggestions or queries regarding this site to [email protected].