In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Testing ADK Agents with Vertex AI Evaluation Service

Testing ADK Agents with Vertex AI Evaluation Service

Author: Venkata Sudhakar

ShopMax India uses the Vertex AI Evaluation Service to score ADK agent responses at scale - running hundreds of order queries through the agent and having a judge model rate each response for correctness, fluency, and groundedness. This replaces manual QA review before major releases and provides a quantitative quality trend that the team can track across prompt versions and model upgrades without relying on subjective human review.

The Vertex AI Evaluation Service accepts a dataset of input-output pairs, runs a set of metrics (BLEU, ROUGE, coherence, groundedness, fluency), and returns per-sample and aggregate scores. In tests, mock the evaluation client to return a fixed score dict so CI runs without Vertex API credentials. In staging, run the real evaluation on a representative dataset of 100 order queries and assert that aggregate scores meet the deployment threshold before promoting to production.

The example below defines a mock Vertex AI evaluation client for ShopMax India. It evaluates a small dataset of order responses and asserts on aggregate scores for correctness and fluency. A threshold check blocks deployment if scores drop below acceptable levels.

import pytest

EVAL_DATASET = [
    {"input": "Track ORD-9001", "response": "Order ORD-9001 is Shipped to Mumbai.", "reference": "ORD-9001 Shipped Mumbai"},
    {"input": "Track ORD-9002", "response": "Order ORD-9002 is Delivered to Delhi.", "reference": "ORD-9002 Delivered Delhi"},
    {"input": "Return ORD-9003", "response": "Return initiated for ORD-9003 from Bangalore.", "reference": "return ORD-9003 Bangalore"},
]

def mock_vertex_evaluate(dataset):
    scores = []
    for item in dataset:
        ref_words = set(item["reference"].lower().split())
        resp_words = set(item["response"].lower().split())
        overlap = len(ref_words & resp_words) / max(len(ref_words), 1)
        scores.append({"correctness": round(overlap, 2), "fluency": 0.92})
    avg_correctness = sum(s["correctness"] for s in scores) / len(scores)
    avg_fluency = sum(s["fluency"] for s in scores) / len(scores)
    return {"per_sample": scores, "aggregate": {"correctness": round(avg_correctness, 2), "fluency": round(avg_fluency, 2)}}

THRESHOLDS = {"correctness": 0.60, "fluency": 0.85}

def test_evaluation_returns_scores_for_all_samples():
    result = mock_vertex_evaluate(EVAL_DATASET)
    assert len(result["per_sample"]) == len(EVAL_DATASET)
    for s in result["per_sample"]:
        assert "correctness" in s
        assert "fluency" in s

def test_aggregate_scores_meet_threshold():
    result = mock_vertex_evaluate(EVAL_DATASET)
    agg = result["aggregate"]
    print("Correctness: " + str(agg["correctness"]) + " Fluency: " + str(agg["fluency"]))
    assert agg["correctness"] >= THRESHOLDS["correctness"], "Correctness below threshold"
    assert agg["fluency"] >= THRESHOLDS["fluency"], "Fluency below threshold"

def test_deployment_blocked_on_low_scores():
    bad_dataset = [{"input": "q", "response": "totally wrong answer xyz", "reference": "ORD shipped Mumbai"}]
    result = mock_vertex_evaluate(bad_dataset)
    agg = result["aggregate"]
    blocked = agg["correctness"] < THRESHOLDS["correctness"]
    print("Deployment blocked: " + str(blocked) + " score=" + str(agg["correctness"]))
    assert blocked is True

It gives the following output,

Correctness: 0.67 Fluency: 0.92
Deployment blocked: True score=0.0
... (3 passed in 0.01s)

In production, ShopMax India should run the Vertex AI Evaluation Service on a stratified dataset covering all major intent types: order tracking, returns, EMI queries, product search, and escalations. Store evaluation results in BigQuery so quality trends are visible in a dashboard over time. Set separate thresholds per metric per intent type - return handling may require higher correctness than general product queries - and alert the AI team when any category drops below its threshold across two consecutive deployments.

Send your comments, suggestions or queries regarding this site to [email protected].