tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Agentic AI > ADK Agent Testing > Testing ADK Agents with Vertex AI Evaluation Service

Testing ADK Agents with Vertex AI Evaluation Service

Author: Venkata Sudhakar

ShopMax India uses the Vertex AI Evaluation Service to score ADK agent responses at scale - running hundreds of order queries through the agent and having a judge model rate each response for correctness, fluency, and groundedness. This replaces manual QA review before major releases and provides a quantitative quality trend that the team can track across prompt versions and model upgrades without relying on subjective human review.

The Vertex AI Evaluation Service accepts a dataset of input-output pairs, runs a set of metrics (BLEU, ROUGE, coherence, groundedness, fluency), and returns per-sample and aggregate scores. In tests, mock the evaluation client to return a fixed score dict so CI runs without Vertex API credentials. In staging, run the real evaluation on a representative dataset of 100 order queries and assert that aggregate scores meet the deployment threshold before promoting to production.

The example below defines a mock Vertex AI evaluation client for ShopMax India. It evaluates a small dataset of order responses and asserts on aggregate scores for correctness and fluency. A threshold check blocks deployment if scores drop below acceptable levels.


It gives the following output,

Correctness: 0.67 Fluency: 0.92
Deployment blocked: True score=0.0
... (3 passed in 0.01s)

In production, ShopMax India should run the Vertex AI Evaluation Service on a stratified dataset covering all major intent types: order tracking, returns, EMI queries, product search, and escalations. Store evaluation results in BigQuery so quality trends are visible in a dashboard over time. Set separate thresholds per metric per intent type - return handling may require higher correctness than general product queries - and alert the AI team when any category drops below its threshold across two consecutive deployments.


 
  


  
bl  br