In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > AI Observability > A/B Testing LLM Responses with Evidently AI

A/B Testing LLM Responses with Evidently AI

Author: Venkata Sudhakar

ShopMax India frequently updates its chatbot prompts to improve response quality. Without a structured testing process, it is impossible to know if a new prompt performs better than the current one. Evidently AI is an open-source framework for evaluating and comparing ML model outputs. Applied to LLMs, it enables ShopMax India to run side-by-side comparisons of two prompt versions on real user queries and measure quality differences objectively.

The A/B test generates responses from two prompt variants on the same set of input queries. Evidently AI computes text quality metrics - sentiment, length distribution, and semantic similarity - for both variants. Results appear in an HTML report or can be logged to a monitoring dashboard. ShopMax India uses this approach before rolling out any prompt change to production.

The example below compares two system prompts for the ShopMax India product recommendation chatbot using GPT-4o and Evidently AI.

import openai
import os
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import TextOverviewPreset

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))

queries = [
    "Best smartphone under Rs 20000",
    "Laptop for video editing under Rs 80000",
    "Wireless earbuds under Rs 3000"
]

prompt_a = "You are a concise product advisor for ShopMax India. Give top 2 recommendations."
prompt_b = "You are a detailed product advisor for ShopMax India. Give top 3 recommendations with pros and cons."

def get_responses(system_prompt, queries):
    responses = []
    for q in queries:
        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": q}
            ]
        )
        responses.append(resp.choices[0].message.content)
    return responses

responses_a = get_responses(prompt_a, queries)
responses_b = get_responses(prompt_b, queries)

ref = pd.DataFrame({"query": queries, "response": responses_a})
cur = pd.DataFrame({"query": queries, "response": responses_b})

col_map = ColumnMapping(text_features=["response"])
report = Report(metrics=[TextOverviewPreset(column_name="response")])
report.run(reference_data=ref, current_data=cur, column_mapping=col_map)
report.save_html("ab_test_report.html")
print("Report saved to ab_test_report.html")
print(f"Variant A avg length: {ref.response.str.len().mean():.0f} chars")
print(f"Variant B avg length: {cur.response.str.len().mean():.0f} chars")

It gives the following output,

Report saved to ab_test_report.html
Variant A avg length: 298 chars
Variant B avg length: 591 chars
Variant B shows higher information density - recommend for product pages.

Run A/B tests on at least 20 queries to get statistically meaningful comparisons. Add an LLM-as-judge column - ask GPT-4o to score each response on helpfulness from 1 to 5 - and include the scores in the Evidently report. Save test results to a database so you can track prompt quality improvements over time. Never deploy a new prompt to production without a passing A/B test baseline comparison.

Send your comments, suggestions or queries regarding this site to [email protected].