|
|
Response Consistency Testing for ADK Agents Across Multiple Runs
Author: Venkata Sudhakar
Response consistency testing verifies that an ADK agent produces stable, reliable outputs when given the same input across multiple runs. ShopMax India runs consistency tests on its product recommendation and order status agents to ensure customers in Mumbai and Bangalore receive coherent answers regardless of minor prompt variations or non-deterministic LLM sampling.
The approach is to call the tool function N times with identical inputs, collect all responses, then measure variance using string similarity (Levenshtein ratio) or semantic overlap. A consistency score is computed as the average pairwise similarity across runs. A score below a threshold (e.g. 0.85) flags the agent as unstable and blocks the release. Temperature must be fixed at 0 for deterministic tests, or variance bands must be set appropriately for non-zero temperature.
The example below runs a product lookup tool 5 times with the same query, computes pairwise similarity using difflib.SequenceMatcher, and asserts the mean consistency score exceeds the 0.90 threshold.
It gives the following output,
Consistency score: 1.0000 (10 pairs, 5 runs)
Response length variance: 0 chars
2 passed in 0.05s
For LLM-backed tools where temperature is non-zero, set the consistency threshold lower (0.75-0.80) and increase RUNS to 10-20 for statistical significance. Log all raw responses to a file when a test fails so the exact drift pattern is visible in CI artifacts. Re-run consistency tests after every prompt change to catch regressions introduced by rewording that seems harmless but shifts the response distribution.
|
|