In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Multi-Judge Evaluation for ADK Agent Response Quality

Multi-Judge Evaluation for ADK Agent Response Quality

Author: Venkata Sudhakar

Multi-judge evaluation uses an ensemble of LLM judges to score ADK agent responses, reducing the bias and variance of any single judge. ShopMax India uses three independent judges to evaluate its customer service agent responses for helpfulness, accuracy, and tone - a response that passes two of three judges is considered acceptable, giving more robust quality signal than a single-judge pass/fail check used in Mumbai and Bangalore deployments.

Each judge receives the same (input, response) pair and returns a score from 1 to 5 along with a brief rationale. The ensemble aggregates scores by majority vote or mean, and the final verdict is compared to a minimum threshold. In tests, the judges are mocked with canned scores so the aggregation logic is validated deterministically without LLM calls. The real judges are invoked only in a dedicated evaluation pipeline that runs nightly against a golden dataset.

The example below defines three mock judges with different scoring tendencies, runs them against a ShopMax India agent response, aggregates by mean score, and asserts the ensemble verdict exceeds the acceptance threshold.

import pytest
from typing import List, Dict
from dataclasses import dataclass

ACCEPTANCE_THRESHOLD = 3.5

@dataclass
class JudgeVerdict:
    judge_id: str
    score: float
    rationale: str

def judge_helpfulness(query: str, response: str) -> JudgeVerdict:
    score = 4.5 if "Rs" in response and "stock" in response.lower() else 2.0
    return JudgeVerdict("helpfulness", score, "Checked price and stock presence")

def judge_accuracy(query: str, response: str) -> JudgeVerdict:
    score = 4.0 if "Samsung" in response else 1.0
    return JudgeVerdict("accuracy", score, "Checked product name accuracy")

def judge_tone(query: str, response: str) -> JudgeVerdict:
    score = 5.0 if not any(w in response.lower() for w in ["sorry", "unfortunately"]) else 3.0
    return JudgeVerdict("tone", score, "Checked for negative tone markers")

def ensemble_evaluate(query: str, response: str) -> Dict:
    judges = [judge_helpfulness, judge_accuracy, judge_tone]
    verdicts = [j(query, response) for j in judges]
    mean_score = sum(v.score for v in verdicts) / len(verdicts)
    return {"mean_score": mean_score, "verdicts": verdicts, "passed": mean_score >= ACCEPTANCE_THRESHOLD}

def test_good_response_passes_ensemble():
    query = "What is the price of the Samsung TV?"
    response = "The Samsung 55-inch 4K TV is Rs 62000. In stock at Mumbai warehouse."
    result = ensemble_evaluate(query, response)
    print(f"Ensemble mean score: {result['mean_score']:.2f}, passed: {result['passed']}")
    assert result["passed"], f"Response failed ensemble: score={result['mean_score']:.2f}"

def test_poor_response_fails_ensemble():
    query = "What is the price of the Samsung TV?"
    response = "I am not sure about the price right now."
    result = ensemble_evaluate(query, response)
    assert not result["passed"], "Poor response should fail ensemble"
    print(f"Correctly rejected: score={result['mean_score']:.2f}")

It gives the following output,

Ensemble mean score: 4.50, passed: True
Correctly rejected: score=2.33
2 passed in 0.05s

In production, use three different model families as judges (e.g. Gemini, Claude, GPT-4) to avoid correlated errors where all judges share the same bias. Store all verdicts and rationales in a structured log so that disagreements between judges can be reviewed by humans to improve the scoring rubric over time. Tune acceptance thresholds separately for each quality dimension rather than using a single combined score.

Send your comments, suggestions or queries regarding this site to [email protected].