tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Agentic AI > ADK Agent Testing > Reference-Free Evaluation of ADK Agent Responses

Reference-Free Evaluation of ADK Agent Responses

Author: Venkata Sudhakar

Reference-free evaluation scores ADK agent responses without needing a human-written gold answer, making it practical for production scenarios where ground truth is unavailable or expensive to collect. ShopMax India uses reference-free evaluation to score its search and recommendation agent responses daily across thousands of live queries from customers in Mumbai, Bangalore, and Hyderabad - queries where no pre-written ideal answer exists.

Reference-free metrics evaluate properties that can be assessed from the response alone: fluency (is the text grammatically coherent), completeness (are key required elements present), relevance (does the response address the query topic), and safety (does it contain forbidden content). Each metric is a scoring function returning 0.0 to 1.0. A composite score aggregates all metrics and is compared against a minimum threshold to gate deployment.

The example below defines four reference-free metrics for a ShopMax India agent response, computes a composite score, and asserts it meets the production threshold without any gold reference answer.


It gives the following output,

Scores: {'fluency': 0.8, 'completeness': 1.0, 'relevance': 0.5, 'safety': 1.0, 'composite': 0.825}
Incomplete scores: {'fluency': 0.267, 'completeness': 0.0, 'relevance': 0.25, 'safety': 1.0, 'composite': 0.379}
2 passed in 0.04s

Reference-free evaluation is most valuable when combined with a small human-reviewed sample to validate that the automated metrics correlate with actual quality. Tune metric weights based on the agent's domain - for a pricing agent, completeness (price present) and safety (no harmful content) should have higher weights than fluency. Log composite scores to a time-series database and alert when the rolling 24-hour average drops more than 5% from the previous week's baseline.


 
  


  
bl  br