|
|
Semantic Similarity Testing for ADK Agent Responses
Author: Venkata Sudhakar
Exact string matching fails for natural language agent responses - two responses can mean the same thing but fail a string equality check. ShopMax India uses semantic similarity testing to compare agent responses against golden reference answers using cosine similarity between sentence embeddings, making tests robust to paraphrasing while still catching factual errors.
The approach embeds both the agent response and a reference answer using a sentence embedding model, then computes cosine similarity between the two vectors. A similarity above a threshold (typically 0.85 for factual responses) indicates a semantically equivalent answer. This works well for order status, product information, and policy queries where meaning matters more than exact wording. Lower thresholds (0.7) suit open-ended questions where multiple valid phrasings exist.
The example shows ShopMax India embedding responses with sentence-transformers and running similarity checks as pytest assertions. The test validates that the agent response is semantically close to a golden answer without requiring identical text.
It gives the following output,
order_status_dispatched similarity: 0.993
out_of_stock_query similarity: 0.991
Use all-MiniLM-L6-v2 from sentence-transformers as the embedding model for a good speed/quality balance. Cache embeddings for golden reference answers at test suite startup to avoid recomputing them on every run. Set per-query-type thresholds - factual lookups need 0.85 or higher, conversational responses 0.70 or higher. When a test fails, log both the response and reference text so engineers can judge whether the threshold needs tuning or the agent actually regressed. Recompute golden embeddings whenever you update the embedding model version.
|
|