|
|
Multi-Judge Evaluation for ADK Agent Response Quality
Author: Venkata Sudhakar
Multi-judge evaluation uses an ensemble of LLM judges to score ADK agent responses, reducing the bias and variance of any single judge. ShopMax India uses three independent judges to evaluate its customer service agent responses for helpfulness, accuracy, and tone - a response that passes two of three judges is considered acceptable, giving more robust quality signal than a single-judge pass/fail check used in Mumbai and Bangalore deployments.
Each judge receives the same (input, response) pair and returns a score from 1 to 5 along with a brief rationale. The ensemble aggregates scores by majority vote or mean, and the final verdict is compared to a minimum threshold. In tests, the judges are mocked with canned scores so the aggregation logic is validated deterministically without LLM calls. The real judges are invoked only in a dedicated evaluation pipeline that runs nightly against a golden dataset.
The example below defines three mock judges with different scoring tendencies, runs them against a ShopMax India agent response, aggregates by mean score, and asserts the ensemble verdict exceeds the acceptance threshold.
It gives the following output,
Ensemble mean score: 4.50, passed: True
Correctly rejected: score=2.33
2 passed in 0.05s
In production, use three different model families as judges (e.g. Gemini, Claude, GPT-4) to avoid correlated errors where all judges share the same bias. Store all verdicts and rationales in a structured log so that disagreements between judges can be reviewed by humans to improve the scoring rubric over time. Tune acceptance thresholds separately for each quality dimension rather than using a single combined score.
|
|