tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Agentic AI > ADK Agent Testing > Evaluating ADK Agent Response Quality with Metrics

Evaluating ADK Agent Response Quality with Metrics

Author: Venkata Sudhakar

Beyond pass/fail assertions, ShopMax India needs to measure how good their ADK agent responses actually are. A pricing agent might technically mention the product name but miss the price, or a support agent might give technically correct information in a tone that alienates customers. Metric-based evaluation scores each response on keyword coverage and token overlap against a reference answer, catching quality regressions that boolean assertions miss.

Two lightweight metrics work well without external model dependencies. Token overlap score measures what fraction of reference answer words appear in the agent response, catching completeness issues. Keyword coverage checks whether a predefined list of critical terms - order ID, price, city - all appear in the reply. Both metrics run in milliseconds, return a float between 0 and 1, and plug directly into pytest assertions with a configurable threshold per test case.

The example below defines both metrics and applies them to two ShopMax India evaluation cases - an order tracking reply and a pricing reply - with per-case thresholds and clear failure messages.


It gives the following output,

tests/test_eval.py::test_token_overlap[order_tracking] PASSED
tests/test_eval.py::test_token_overlap[product_pricing] PASSED
tests/test_eval.py::test_keyword_coverage[order_tracking] PASSED
tests/test_eval.py::test_keyword_coverage[product_pricing] PASSED

4 passed in 0.05s

In production, collect real agent responses from a staging run rather than hardcoding them in the eval set - this tests the actual live model output against reference answers written by the ShopMax product team. Set min_coverage to 1.0 only for safety-critical fields like price and order ID; use 0.7-0.8 for softer fields like city names. Log scores to a time-series database so you can track quality trends across model upgrades and prompt iterations over time.


 
  


  
bl  br