|
|
A/B Testing ADK Agent Prompts in Production
Author: Venkata Sudhakar
ShopMax India's team regularly experiments with prompt changes - a more concise system prompt, a different tone for escalation handling, or additional context about return policies. Instead of guessing which version performs better, A/B testing splits live traffic between two variants and uses statistical hypothesis testing to determine a winner with confidence. This replaces opinion-driven prompt decisions with data-driven ones.
The A/B test framework assigns each user session to a variant (A or B) using a deterministic hash of the session ID so the same user always gets the same variant. Outcomes are recorded as binary success signals - did the customer resolve their query without escalating? After enough sessions, a z-test compares the conversion rates of both variants. A z-score above 1.96 means the difference is statistically significant at 95% confidence. Below that threshold, collect more data before drawing conclusions.
The example shows ShopMax India simulating 200 sessions split between two prompt variants. Variant B has a higher success rate and the z-test confirms the difference is statistically significant, making B the clear winner.
It gives the following output,
Variant A - sessions: 103 | success rate: 0.748
Variant B - sessions: 97 | success rate: 0.856
Z-score: 2.143
Statistically significant: True
Recommendation: Deploy B
Run A/B tests for a minimum of 200 sessions per variant before reading results - smaller samples produce unreliable z-scores. Use a deterministic hash for variant assignment so users get a consistent experience across sessions. Define the success metric before starting the test - resolving without escalation, CSAT rating above 4, or task completion - not after seeing the results. After declaring a winner, update the canonical prompt, retire the losing variant, and commit the winning prompt to version control. Run one A/B test at a time to avoid confounding effects between simultaneous experiments.
|
|