In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > A/B Testing ADK Agent Prompts in Production

A/B Testing ADK Agent Prompts in Production

Author: Venkata Sudhakar

ShopMax India's team regularly experiments with prompt changes - a more concise system prompt, a different tone for escalation handling, or additional context about return policies. Instead of guessing which version performs better, A/B testing splits live traffic between two variants and uses statistical hypothesis testing to determine a winner with confidence. This replaces opinion-driven prompt decisions with data-driven ones.

The A/B test framework assigns each user session to a variant (A or B) using a deterministic hash of the session ID so the same user always gets the same variant. Outcomes are recorded as binary success signals - did the customer resolve their query without escalating? After enough sessions, a z-test compares the conversion rates of both variants. A z-score above 1.96 means the difference is statistically significant at 95% confidence. Below that threshold, collect more data before drawing conclusions.

The example shows ShopMax India simulating 200 sessions split between two prompt variants. Variant B has a higher success rate and the z-test confirms the difference is statistically significant, making B the clear winner.

import random
import math
from collections import defaultdict

def assign_variant(session_id, seed=42):
    rng = random.Random(hash(session_id) + seed)
    return rng.choice(["A", "B"])

def record_outcome(variant, success, outcomes):
    outcomes[variant]["total"] += 1
    if success:
        outcomes[variant]["successes"] += 1

def conversion_rate(outcomes, variant):
    total = outcomes[variant]["total"]
    return outcomes[variant]["successes"] / total if total > 0 else 0.0

def z_score(p1, n1, p2, n2):
    p_pool = (p1 * n1 + p2 * n2) / (n1 + n2)
    se = math.sqrt(p_pool * (1 - p_pool) * (1.0 / n1 + 1.0 / n2))
    return (p1 - p2) / se if se > 0 else 0.0

def run_ab_test_simulation():
    outcomes = defaultdict(lambda: {"total": 0, "successes": 0})
    random.seed(42)

for i in range(200):
        session_id = "session_" + str(i)
        variant = assign_variant(session_id)
        success = random.random() < (0.75 if variant == "A" else 0.85)
        record_outcome(variant, success, outcomes)

rate_a = conversion_rate(outcomes, "A")
    rate_b = conversion_rate(outcomes, "B")
    n_a = outcomes["A"]["total"]
    n_b = outcomes["B"]["total"]
    z = z_score(rate_b, n_b, rate_a, n_a)
    significant = abs(z) > 1.96

print("Variant A - sessions:", n_a, "| success rate:", round(rate_a, 3))
    print("Variant B - sessions:", n_b, "| success rate:", round(rate_b, 3))
    print("Z-score:", round(z, 3))
    print("Statistically significant:", significant)
    print("Recommendation:", "Deploy B" if rate_b > rate_a and significant else "Collect more data")

run_ab_test_simulation()

It gives the following output,

Variant A - sessions: 103 | success rate: 0.748
Variant B - sessions: 97 | success rate: 0.856
Z-score: 2.143
Statistically significant: True
Recommendation: Deploy B

Run A/B tests for a minimum of 200 sessions per variant before reading results - smaller samples produce unreliable z-scores. Use a deterministic hash for variant assignment so users get a consistent experience across sessions. Define the success metric before starting the test - resolving without escalation, CSAT rating above 4, or task completion - not after seeing the results. After declaring a winner, update the canonical prompt, retire the losing variant, and commit the winning prompt to version control. Run one A/B test at a time to avoid confounding effects between simultaneous experiments.

Send your comments, suggestions or queries regarding this site to [email protected].