In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Evaluating ADK Agent Response Quality with Metrics

Evaluating ADK Agent Response Quality with Metrics

Author: Venkata Sudhakar

Beyond pass/fail assertions, ShopMax India needs to measure how good their ADK agent responses actually are. A pricing agent might technically mention the product name but miss the price, or a support agent might give technically correct information in a tone that alienates customers. Metric-based evaluation scores each response on keyword coverage and token overlap against a reference answer, catching quality regressions that boolean assertions miss.

Two lightweight metrics work well without external model dependencies. Token overlap score measures what fraction of reference answer words appear in the agent response, catching completeness issues. Keyword coverage checks whether a predefined list of critical terms - order ID, price, city - all appear in the reply. Both metrics run in milliseconds, return a float between 0 and 1, and plug directly into pytest assertions with a configurable threshold per test case.

The example below defines both metrics and applies them to two ShopMax India evaluation cases - an order tracking reply and a pricing reply - with per-case thresholds and clear failure messages.

import pytest

def token_overlap_score(response, reference):
    ref_tokens = set(reference.lower().split())
    res_tokens = set(response.lower().split())
    if not ref_tokens:
        return 0.0
    return len(ref_tokens & res_tokens) / len(ref_tokens)

def keyword_coverage(response, keywords):
    hits = [k for k in keywords if k.lower() in response.lower()]
    return len(hits) / len(keywords)

EVAL_SET = [
    {
        "name": "order_tracking",
        "agent_reply": "Your order ORD-7821 has been dispatched from Bangalore warehouse.",
        "reference": "order ORD-7821 dispatched Bangalore",
        "keywords": ["ORD-7821", "dispatched", "Bangalore"],
        "min_overlap": 0.6,
        "min_coverage": 1.0
    },
    {
        "name": "product_pricing",
        "agent_reply": "Sony WH-1000XM5 is priced at Rs 28,990 on ShopMax with free delivery.",
        "reference": "Sony WH-1000XM5 Rs 28990 ShopMax price",
        "keywords": ["Sony", "Rs", "ShopMax", "28,990"],
        "min_overlap": 0.5,
        "min_coverage": 1.0
    }
]

@pytest.mark.parametrize("case", EVAL_SET, ids=[c["name"] for c in EVAL_SET])
def test_token_overlap(case):
    score = token_overlap_score(case["agent_reply"], case["reference"])
    assert score >= case["min_overlap"], case["name"] + " overlap: " + str(round(score, 2))

@pytest.mark.parametrize("case", EVAL_SET, ids=[c["name"] for c in EVAL_SET])
def test_keyword_coverage(case):
    score = keyword_coverage(case["agent_reply"], case["keywords"])
    assert score >= case["min_coverage"], case["name"] + " coverage: " + str(round(score, 2))

if __name__ == "__main__":
    pytest.main([__file__, "-v"])

It gives the following output,

tests/test_eval.py::test_token_overlap[order_tracking] PASSED
tests/test_eval.py::test_token_overlap[product_pricing] PASSED
tests/test_eval.py::test_keyword_coverage[order_tracking] PASSED
tests/test_eval.py::test_keyword_coverage[product_pricing] PASSED

4 passed in 0.05s

In production, collect real agent responses from a staging run rather than hardcoding them in the eval set - this tests the actual live model output against reference answers written by the ShopMax product team. Set min_coverage to 1.0 only for safety-critical fields like price and order ID; use 0.7-0.8 for softer fields like city names. Log scores to a time-series database so you can track quality trends across model upgrades and prompt iterations over time.

Send your comments, suggestions or queries regarding this site to [email protected].