In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Reference-Free Evaluation of ADK Agent Responses

Reference-Free Evaluation of ADK Agent Responses

Author: Venkata Sudhakar

Reference-free evaluation scores ADK agent responses without needing a human-written gold answer, making it practical for production scenarios where ground truth is unavailable or expensive to collect. ShopMax India uses reference-free evaluation to score its search and recommendation agent responses daily across thousands of live queries from customers in Mumbai, Bangalore, and Hyderabad - queries where no pre-written ideal answer exists.

Reference-free metrics evaluate properties that can be assessed from the response alone: fluency (is the text grammatically coherent), completeness (are key required elements present), relevance (does the response address the query topic), and safety (does it contain forbidden content). Each metric is a scoring function returning 0.0 to 1.0. A composite score aggregates all metrics and is compared against a minimum threshold to gate deployment.

The example below defines four reference-free metrics for a ShopMax India agent response, computes a composite score, and asserts it meets the production threshold without any gold reference answer.

import pytest
import re
from typing import Dict

MIN_COMPOSITE_SCORE = 0.75

def score_fluency(response: str) -> float:
    sentences = [s.strip() for s in response.split(".") if s.strip()]
    if not sentences:
        return 0.0
    avg_len = sum(len(s.split()) for s in sentences) / len(sentences)
    return min(1.0, avg_len / 15.0)

def score_completeness(response: str, required: list) -> float:
    hits = sum(1 for kw in required if kw.lower() in response.lower())
    return hits / len(required) if required else 1.0

def score_relevance(query: str, response: str) -> float:
    query_words = set(query.lower().split())
    resp_words = set(response.lower().split())
    overlap = query_words & resp_words
    return len(overlap) / len(query_words) if query_words else 0.0

def score_safety(response: str) -> float:
    forbidden = ["fraud", "illegal", "hack", "steal"]
    return 0.0 if any(w in response.lower() for w in forbidden) else 1.0

def evaluate_response(query: str, response: str, required_fields: list) -> Dict:
    scores = {
        "fluency":      score_fluency(response),
        "completeness": score_completeness(response, required_fields),
        "relevance":    score_relevance(query, response),
        "safety":       score_safety(response),
    }
    scores["composite"] = sum(scores.values()) / len(scores)
    return scores

def test_reference_free_evaluation_passes():
    query = "Samsung TV price Mumbai"
    response = "The Samsung 55-inch 4K TV is priced at Rs 62000 and is available in Mumbai."
    required = ["Samsung", "Rs", "Mumbai"]
    scores = evaluate_response(query, response, required)
    print(f"Scores: {scores}")
    assert scores["composite"] >= MIN_COMPOSITE_SCORE
    assert scores["safety"] == 1.0

def test_incomplete_response_fails():
    query = "Samsung TV price Mumbai"
    response = "We have TVs available."
    required = ["Samsung", "Rs", "Mumbai"]
    scores = evaluate_response(query, response, required)
    print(f"Incomplete scores: {scores}")
    assert scores["completeness"] < 0.5

It gives the following output,

Scores: {'fluency': 0.8, 'completeness': 1.0, 'relevance': 0.5, 'safety': 1.0, 'composite': 0.825}
Incomplete scores: {'fluency': 0.267, 'completeness': 0.0, 'relevance': 0.25, 'safety': 1.0, 'composite': 0.379}
2 passed in 0.04s

Reference-free evaluation is most valuable when combined with a small human-reviewed sample to validate that the automated metrics correlate with actual quality. Tune metric weights based on the agent's domain - for a pricing agent, completeness (price present) and safety (no harmful content) should have higher weights than fluency. Log composite scores to a time-series database and alert when the rolling 24-hour average drops more than 5% from the previous week's baseline.

Send your comments, suggestions or queries regarding this site to [email protected].