In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Response Consistency Testing for ADK Agents Across Multiple Runs

Response Consistency Testing for ADK Agents Across Multiple Runs

Author: Venkata Sudhakar

Response consistency testing verifies that an ADK agent produces stable, reliable outputs when given the same input across multiple runs. ShopMax India runs consistency tests on its product recommendation and order status agents to ensure customers in Mumbai and Bangalore receive coherent answers regardless of minor prompt variations or non-deterministic LLM sampling.

The approach is to call the tool function N times with identical inputs, collect all responses, then measure variance using string similarity (Levenshtein ratio) or semantic overlap. A consistency score is computed as the average pairwise similarity across runs. A score below a threshold (e.g. 0.85) flags the agent as unstable and blocks the release. Temperature must be fixed at 0 for deterministic tests, or variance bands must be set appropriately for non-zero temperature.

The example below runs a product lookup tool 5 times with the same query, computes pairwise similarity using difflib.SequenceMatcher, and asserts the mean consistency score exceeds the 0.90 threshold.

import pytest
import difflib
from itertools import combinations

CONSISTENCY_THRESHOLD = 0.90
RUNS = 5

def get_product_info(product_id: str) -> str:
    return (
        f"Product {product_id}: Samsung 55-inch 4K TV, "
        f"Price Rs 62000, In stock at Mumbai warehouse."
    )

def similarity(a: str, b: str) -> float:
    return difflib.SequenceMatcher(None, a, b).ratio()

def test_response_consistency():
    responses = [get_product_info("PROD-TV-001") for _ in range(RUNS)]
    pairs = list(combinations(responses, 2))
    scores = [similarity(a, b) for a, b in pairs]
    mean_score = sum(scores) / len(scores)
    print(f"Consistency score: {mean_score:.4f} ({len(pairs)} pairs, {RUNS} runs)")
    assert mean_score >= CONSISTENCY_THRESHOLD, (
        f"Consistency below threshold: {mean_score:.4f} < {CONSISTENCY_THRESHOLD}"
    )

def test_response_variance_within_bounds():
    responses = [get_product_info("PROD-TV-001") for _ in range(RUNS)]
    lengths = [len(r) for r in responses]
    length_variance = max(lengths) - min(lengths)
    print(f"Response length variance: {length_variance} chars")
    assert length_variance <= 20, f"Length variance too high: {length_variance}"

It gives the following output,

Consistency score: 1.0000 (10 pairs, 5 runs)
Response length variance: 0 chars
2 passed in 0.05s

For LLM-backed tools where temperature is non-zero, set the consistency threshold lower (0.75-0.80) and increase RUNS to 10-20 for statistical significance. Log all raw responses to a file when a test fails so the exact drift pattern is visible in CI artifacts. Re-run consistency tests after every prompt change to catch regressions introduced by rewording that seems harmless but shifts the response distribution.

Send your comments, suggestions or queries regarding this site to [email protected].