|
|
A/B Testing LLM Responses with Evidently AI
Author: Venkata Sudhakar
ShopMax India frequently updates its chatbot prompts to improve response quality. Without a structured testing process, it is impossible to know if a new prompt performs better than the current one. Evidently AI is an open-source framework for evaluating and comparing ML model outputs. Applied to LLMs, it enables ShopMax India to run side-by-side comparisons of two prompt versions on real user queries and measure quality differences objectively.
The A/B test generates responses from two prompt variants on the same set of input queries. Evidently AI computes text quality metrics - sentiment, length distribution, and semantic similarity - for both variants. Results appear in an HTML report or can be logged to a monitoring dashboard. ShopMax India uses this approach before rolling out any prompt change to production.
The example below compares two system prompts for the ShopMax India product recommendation chatbot using GPT-4o and Evidently AI.
It gives the following output,
Report saved to ab_test_report.html
Variant A avg length: 298 chars
Variant B avg length: 591 chars
Variant B shows higher information density - recommend for product pages.
Run A/B tests on at least 20 queries to get statistically meaningful comparisons. Add an LLM-as-judge column - ask GPT-4o to score each response on helpfulness from 1 to 5 - and include the scores in the Evidently report. Save test results to a database so you can track prompt quality improvements over time. Never deploy a new prompt to production without a passing A/B test baseline comparison.
|
|