|
|
Pairwise Preference Evaluation for ADK Agent Responses
Author: Venkata Sudhakar
Pairwise preference evaluation determines which of two ADK agent responses a judge prefers for a given input, providing a relative quality signal that is often more reliable than absolute scoring. ShopMax India uses pairwise evaluation when comparing a new agent version against the current production version - the new version must win or tie at least 60% of head-to-head comparisons on a golden dataset before it is approved for rollout to customers in Delhi and Bangalore.
A preference evaluator receives (query, response_A, response_B) and returns 'A', 'B', or 'tie'. Responses are shuffled before evaluation to avoid position bias. Win rate is computed as (wins + 0.5 * ties) / total_pairs. Tests run the evaluator against a fixture dataset with known expected preferences and assert the win rate exceeds the promotion threshold. In unit tests, the evaluator is replaced with a rule-based mock that prefers responses containing key facts.
The example below defines a rule-based preference evaluator, runs five pairwise comparisons from a ShopMax India fixture dataset, and asserts the candidate response set achieves the required win rate.
It gives the following output,
Win rate: 0.90 (wins=4, ties=1, losses=0)
1 passed in 0.04s
For production pairwise evaluation, shuffle response order randomly before each judgment to neutralize position bias (judges tend to prefer the first response). Run evaluations in batches of at least 50 pairs to get statistically significant win rates, and use bootstrap confidence intervals to report the margin of error. When a new model version is only marginally better (win rate 0.52), require additional human review before promoting it rather than relying solely on the automated judge.
|
|