|
|
Reference-Free Evaluation of ADK Agent Responses
Author: Venkata Sudhakar
Reference-free evaluation scores ADK agent responses without needing a human-written gold answer, making it practical for production scenarios where ground truth is unavailable or expensive to collect. ShopMax India uses reference-free evaluation to score its search and recommendation agent responses daily across thousands of live queries from customers in Mumbai, Bangalore, and Hyderabad - queries where no pre-written ideal answer exists.
Reference-free metrics evaluate properties that can be assessed from the response alone: fluency (is the text grammatically coherent), completeness (are key required elements present), relevance (does the response address the query topic), and safety (does it contain forbidden content). Each metric is a scoring function returning 0.0 to 1.0. A composite score aggregates all metrics and is compared against a minimum threshold to gate deployment.
The example below defines four reference-free metrics for a ShopMax India agent response, computes a composite score, and asserts it meets the production threshold without any gold reference answer.
It gives the following output,
Scores: {'fluency': 0.8, 'completeness': 1.0, 'relevance': 0.5, 'safety': 1.0, 'composite': 0.825}
Incomplete scores: {'fluency': 0.267, 'completeness': 0.0, 'relevance': 0.25, 'safety': 1.0, 'composite': 0.379}
2 passed in 0.04s
Reference-free evaluation is most valuable when combined with a small human-reviewed sample to validate that the automated metrics correlate with actual quality. Tune metric weights based on the agent's domain - for a pricing agent, completeness (price present) and safety (no harmful content) should have higher weights than fluency. Log composite scores to a time-series database and alert when the rolling 24-hour average drops more than 5% from the previous week's baseline.
|
|