|
|
Evaluating ADK Agent Response Quality with Metrics
Author: Venkata Sudhakar
Beyond pass/fail assertions, ShopMax India needs to measure how good their ADK agent responses actually are. A pricing agent might technically mention the product name but miss the price, or a support agent might give technically correct information in a tone that alienates customers. Metric-based evaluation scores each response on keyword coverage and token overlap against a reference answer, catching quality regressions that boolean assertions miss.
Two lightweight metrics work well without external model dependencies. Token overlap score measures what fraction of reference answer words appear in the agent response, catching completeness issues. Keyword coverage checks whether a predefined list of critical terms - order ID, price, city - all appear in the reply. Both metrics run in milliseconds, return a float between 0 and 1, and plug directly into pytest assertions with a configurable threshold per test case.
The example below defines both metrics and applies them to two ShopMax India evaluation cases - an order tracking reply and a pricing reply - with per-case thresholds and clear failure messages.
It gives the following output,
tests/test_eval.py::test_token_overlap[order_tracking] PASSED
tests/test_eval.py::test_token_overlap[product_pricing] PASSED
tests/test_eval.py::test_keyword_coverage[order_tracking] PASSED
tests/test_eval.py::test_keyword_coverage[product_pricing] PASSED
4 passed in 0.05s
In production, collect real agent responses from a staging run rather than hardcoding them in the eval set - this tests the actual live model output against reference answers written by the ShopMax product team. Set min_coverage to 1.0 only for safety-critical fields like price and order ID; use 0.7-0.8 for softer fields like city names. Log scores to a time-series database so you can track quality trends across model upgrades and prompt iterations over time.
|
|