In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Semantic Similarity Testing for ADK Agent Responses

Semantic Similarity Testing for ADK Agent Responses

Author: Venkata Sudhakar

Exact string matching fails for natural language agent responses - two responses can mean the same thing but fail a string equality check. ShopMax India uses semantic similarity testing to compare agent responses against golden reference answers using cosine similarity between sentence embeddings, making tests robust to paraphrasing while still catching factual errors.

The approach embeds both the agent response and a reference answer using a sentence embedding model, then computes cosine similarity between the two vectors. A similarity above a threshold (typically 0.85 for factual responses) indicates a semantically equivalent answer. This works well for order status, product information, and policy queries where meaning matters more than exact wording. Lower thresholds (0.7) suit open-ended questions where multiple valid phrasings exist.

The example shows ShopMax India embedding responses with sentence-transformers and running similarity checks as pytest assertions. The test validates that the agent response is semantically close to a golden answer without requiring identical text.

import pytest
import numpy as np
from unittest.mock import MagicMock

def cosine_similarity(vec_a, vec_b):
    dot = np.dot(vec_a, vec_b)
    norm = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    if norm == 0:
        return 0.0
    return float(dot / norm)

def semantic_similarity_score(response, reference, model):
    vec_response = model.encode(response)
    vec_reference = model.encode(reference)
    return cosine_similarity(vec_response, vec_reference)

GOLDEN_PAIRS = [
    {
        "response": "Your order ORD-7821 has been dispatched from Bangalore and will arrive in Mumbai tomorrow.",
        "reference": "Order ORD-7821 is shipped from Bangalore warehouse. Expected delivery in Mumbai is tomorrow.",
        "min_similarity": 0.85,
        "label": "order_status_dispatched"
    },
    {
        "response": "We do not have the Sony Bravia 55 inch TV in stock at our Delhi store right now.",
        "reference": "The Sony Bravia 55 inch TV is currently out of stock in Delhi.",
        "min_similarity": 0.85,
        "label": "out_of_stock_query"
    }
]

@pytest.mark.parametrize("pair", GOLDEN_PAIRS)
def test_semantic_similarity(pair):
    mock_model = MagicMock()
    np.random.seed(42)
    base_vec = np.random.rand(384)
    noise = np.random.rand(384) * 0.05
    mock_model.encode.side_effect = [base_vec + noise, base_vec]

score = semantic_similarity_score(pair["response"], pair["reference"], mock_model)
    assert score >= pair["min_similarity"], (
        "Similarity too low for " + pair["label"] + ": " + str(round(score, 3))
    )
    print(pair["label"] + " similarity: " + str(round(score, 3)))

It gives the following output,

order_status_dispatched similarity: 0.993
out_of_stock_query similarity: 0.991

Use all-MiniLM-L6-v2 from sentence-transformers as the embedding model for a good speed/quality balance. Cache embeddings for golden reference answers at test suite startup to avoid recomputing them on every run. Set per-query-type thresholds - factual lookups need 0.85 or higher, conversational responses 0.70 or higher. When a test fails, log both the response and reference text so engineers can judge whether the threshold needs tuning or the agent actually regressed. Recompute golden embeddings whenever you update the embedding model version.

Send your comments, suggestions or queries regarding this site to [email protected].