In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > LLM-as-Judge - Automated Response Validation for ADK Agents

LLM-as-Judge - Automated Response Validation for ADK Agents

Author: Venkata Sudhakar

ShopMax India's ADK agents handle hundreds of customer queries daily. Manually reviewing every agent response for correctness, completeness, and tone is not scalable. The LLM-as-Judge pattern solves this by using a second Gemini call to automatically score and validate agent responses against defined criteria, creating a fully automated quality gate.

The pattern works by sending the original user query, the agent response, and a scoring rubric to an evaluator LLM. The evaluator returns structured scores (0-1) for each criterion - correctness, relevance, completeness. These scores can be used in CI/CD pipelines to fail builds when response quality drops below thresholds. The key is a well-designed rubric that captures what good looks like for your domain.

The example below shows ShopMax India using Gemini Flash as judge to evaluate their order tracking agent responses. The judge receives the query and response, then returns JSON scores for four criteria.

import json
import asyncio
from unittest.mock import AsyncMock, patch
import google.generativeai as genai

JUDGE_PROMPT = (
    "You are an evaluator for ShopMax India customer support agents. "
    "Query: {query}. Response: {response}. "
    "Return JSON with scores 0-1 for: correctness, relevance, completeness, overall."
)

def parse_judge_scores(text):
    start = text.find("{")
    end = text.rfind("}") + 1
    return json.loads(text[start:end])

async def llm_judge(query, response):
    model = genai.GenerativeModel("gemini-2.0-flash")
    prompt = JUDGE_PROMPT.format(query=query, response=response)
    result = await model.generate_content_async(prompt)
    return parse_judge_scores(result.text)

def test_order_response_passes_quality_gate():
    query = "Where is my order ORD-7821?"
    agent_response = "Your order ORD-7821 has been dispatched from Bangalore and arrives in Mumbai tomorrow."
    mock_data = {"correctness": 0.95, "relevance": 1.0, "completeness": 0.9, "overall": 0.95}

with patch("google.generativeai.GenerativeModel.generate_content_async", new_callable=AsyncMock) as mock_judge:
        mock_judge.return_value.text = json.dumps(mock_data)
        scores = asyncio.run(llm_judge(query, agent_response))

assert scores["overall"] >= 0.8, "Failed quality gate"
    assert scores["relevance"] >= 0.9, "Not relevant enough"
    print("LLM-as-Judge scores:", scores)

test_order_response_passes_quality_gate()

It gives the following output,

LLM-as-Judge scores: {'correctness': 0.95, 'relevance': 1.0, 'completeness': 0.9, 'overall': 0.95}

In production, run the LLM judge on a sample of 10-20% of live responses rather than all traffic to control cost. Use gemini-2.0-flash as the judge model to keep latency and cost low. Store judge scores in a time-series database to detect gradual quality drift. Set hard thresholds in CI: fail the build if mean overall score drops below 0.75 on your golden test set. Avoid using the same model as both agent and judge - different model families catch different failure modes.

Send your comments, suggestions or queries regarding this site to [email protected].