|
|
LLM-as-Judge - Automated Response Validation for ADK Agents
Author: Venkata Sudhakar
ShopMax India's ADK agents handle hundreds of customer queries daily. Manually reviewing every agent response for correctness, completeness, and tone is not scalable. The LLM-as-Judge pattern solves this by using a second Gemini call to automatically score and validate agent responses against defined criteria, creating a fully automated quality gate.
The pattern works by sending the original user query, the agent response, and a scoring rubric to an evaluator LLM. The evaluator returns structured scores (0-1) for each criterion - correctness, relevance, completeness. These scores can be used in CI/CD pipelines to fail builds when response quality drops below thresholds. The key is a well-designed rubric that captures what good looks like for your domain.
The example below shows ShopMax India using Gemini Flash as judge to evaluate their order tracking agent responses. The judge receives the query and response, then returns JSON scores for four criteria.
It gives the following output,
LLM-as-Judge scores: {'correctness': 0.95, 'relevance': 1.0, 'completeness': 0.9, 'overall': 0.95}
In production, run the LLM judge on a sample of 10-20% of live responses rather than all traffic to control cost. Use gemini-2.0-flash as the judge model to keep latency and cost low. Store judge scores in a time-series database to detect gradual quality drift. Set hard thresholds in CI: fail the build if mean overall score drops below 0.75 on your golden test set. Avoid using the same model as both agent and judge - different model families catch different failure modes.
|
|