In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Prompt Engineering > Prompt Versioning and A/B Testing for LLM Applications

Prompt Versioning and A/B Testing for LLM Applications

Author: Venkata Sudhakar

Prompt versioning and A/B testing lets ShopMax India measure whether a new prompt actually performs better before rolling it out to all users. When the marketing team wants to rewrite the product description generator prompt, A/B testing splits live traffic between the old and new prompt, measures output quality on each, and provides data to make the switch decision confidently rather than guessing.

The implementation stores prompt versions in a dictionary or database with version IDs, uses a hash of the request ID to split traffic deterministically (same request always hits the same variant), logs each call with its variant and a quality signal, then aggregates the results. Quality signals can be explicit (user rating) or proxy metrics like response length, keyword presence, or downstream conversion rate.

The example below shows ShopMax India A/B testing two product description prompts - version A (feature-focused) vs version B (benefit-focused) - logging results and computing a comparison summary.

import anthropic
import hashlib
import random

client = anthropic.Anthropic()

PROMPT_VERSIONS = {
    "v1": "Write a product description listing the key technical features. Be precise and factual. Under 60 words.",
    "v2": "Write a product description focusing on customer benefits and lifestyle fit for Indian buyers. Be warm and persuasive. Under 60 words."
}

def get_variant(request_id):
    h = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
    return "v1" if h % 2 == 0 else "v2"

def generate_description(product, request_id):
    variant = get_variant(request_id)
    prompt = PROMPT_VERSIONS[variant]
    response = client.messages.create(
        model="claude-haiku-4-5", max_tokens=128,
        messages=[{"role": "user", "content": prompt + "\n\nProduct: " + product}]
    )
    description = response.content[0].text
    word_count = len(description.split())
    return variant, description, word_count

products = [
    ("PROD-001", "Samsung 55-inch 4K QLED TV with HDR10+ and 120Hz"),
    ("PROD-002", "LG 1.5 Ton 5 Star Inverter AC with Wi-Fi"),
    ("PROD-003", "Whirlpool 7kg Front Load Washing Machine"),
    ("PROD-004", "Sony 65-inch OLED TV with Acoustic Surface Audio"),
]

results = {"v1": [], "v2": []}

for prod_id, product in products:
    variant, desc, words = generate_description(product, prod_id)
    results[variant].append({"product": product[:30], "words": words})
    print("Variant:", variant, "| Product:", product[:40])
    print("Description:", desc[:80], "...")
    print()

for v in ["v1", "v2"]:
    data = results[v]
    if data:
        avg_words = sum(r["words"] for r in data) / len(data)
        print("Variant", v, "- calls:", len(data), "| avg words:", round(avg_words, 1))

It gives the following output,

Variant: v1 | Product: Samsung 55-inch 4K QLED TV with HDR10+ and 1
Description: Samsung 55-inch 4K QLED TV delivers stunning visuals with HDR10+ ...

Variant: v2 | Product: LG 1.5 Ton 5 Star Inverter AC with Wi-Fi
Description: Beat the heat effortlessly with LG's smart inverter AC - perfect ...

Variant: v1 | Product: Whirlpool 7kg Front Load Washing Machine
Description: Whirlpool 7kg front load washer with 1200 RPM spin, 12 wash progra ...

Variant: v2 | Product: Sony 65-inch OLED TV with Acoustic Surface Audio
Description: Transform your living room into a cinema with Sony OLED - feel the ...

Variant v1 - calls: 2 | avg words: 42.5
Variant v2 - calls: 2 | avg words: 48.0

At ShopMax India, store all A/B test results in a database table with columns: request_id, variant, timestamp, product_id, and your quality metric. Run each test for at least 1000 impressions before drawing conclusions - small samples produce noisy results. Use statistical significance testing (chi-square for conversion rates, t-test for continuous metrics) before declaring a winner. Keep v1 active until v2 is confirmed better, then gradually shift traffic to 100% v2 over a few days rather than switching instantly.

Send your comments, suggestions or queries regarding this site to [email protected].