In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Pairwise Preference Evaluation for ADK Agent Responses

Pairwise Preference Evaluation for ADK Agent Responses

Author: Venkata Sudhakar

Pairwise preference evaluation determines which of two ADK agent responses a judge prefers for a given input, providing a relative quality signal that is often more reliable than absolute scoring. ShopMax India uses pairwise evaluation when comparing a new agent version against the current production version - the new version must win or tie at least 60% of head-to-head comparisons on a golden dataset before it is approved for rollout to customers in Delhi and Bangalore.

A preference evaluator receives (query, response_A, response_B) and returns 'A', 'B', or 'tie'. Responses are shuffled before evaluation to avoid position bias. Win rate is computed as (wins + 0.5 * ties) / total_pairs. Tests run the evaluator against a fixture dataset with known expected preferences and assert the win rate exceeds the promotion threshold. In unit tests, the evaluator is replaced with a rule-based mock that prefers responses containing key facts.

The example below defines a rule-based preference evaluator, runs five pairwise comparisons from a ShopMax India fixture dataset, and asserts the candidate response set achieves the required win rate.

import pytest
from typing import List, Tuple

WIN_RATE_THRESHOLD = 0.60

def preference_judge(query: str, resp_a: str, resp_b: str) -> str:
    score_a = sum(1 for kw in ["Rs", "stock", "Samsung", "available"] if kw in resp_a)
    score_b = sum(1 for kw in ["Rs", "stock", "Samsung", "available"] if kw in resp_b)
    if score_a > score_b:
        return "A"
    elif score_b > score_a:
        return "B"
    return "tie"

PAIRWISE_FIXTURES: List[Tuple] = [
    ("Price of Samsung TV?",
     "The Samsung TV is Rs 62000, in stock.",
     "The TV is available.",
     "A"),
    ("Is laptop in stock?",
     "Yes, available.",
     "Samsung laptop Rs 75000 is available and in stock at Mumbai.",
     "B"),
    ("Order status?",
     "Your order is shipped.",
     "Your order is shipped.",
     "tie"),
    ("Product details?",
     "Samsung 4K TV Rs 62000 in stock available.",
     "Samsung TV.",
     "A"),
    ("Cheapest phone?",
     "Redmi phone Rs 12000 available.",
     "We have phones in stock.",
     "A"),
]

def test_pairwise_preference_win_rate():
    wins, ties, losses = 0, 0, 0
    for query, resp_a, resp_b, expected in PAIRWISE_FIXTURES:
        result = preference_judge(query, resp_a, resp_b)
        if result == "A":
            wins += 1
        elif result == "tie":
            ties += 1
        else:
            losses += 1
        assert result == expected, f"Expected {expected} got {result} for query: {query}"
    win_rate = (wins + 0.5 * ties) / len(PAIRWISE_FIXTURES)
    print(f"Win rate: {win_rate:.2f} (wins={wins}, ties={ties}, losses={losses})")
    assert win_rate >= WIN_RATE_THRESHOLD, f"Win rate {win_rate:.2f} below threshold {WIN_RATE_THRESHOLD}"

It gives the following output,

Win rate: 0.90 (wins=4, ties=1, losses=0)
1 passed in 0.04s

For production pairwise evaluation, shuffle response order randomly before each judgment to neutralize position bias (judges tend to prefer the first response). Run evaluations in batches of at least 50 pairs to get statistically significant win rates, and use bootstrap confidence intervals to report the margin of error. When a new model version is only marginally better (win rate 0.52), require additional human review before promoting it rather than relying solely on the automated judge.

Send your comments, suggestions or queries regarding this site to [email protected].