In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Canary Deployment Testing for ADK Agents

Canary Deployment Testing for ADK Agents

Author: Venkata Sudhakar

When ShopMax India deploys a new ADK agent version, rolling it out to all customers at once risks exposing every user to a regression if the new version has subtle quality issues. Canary deployment routes a small percentage of traffic to the new version while the rest goes to the stable version, collecting quality metrics from both. Automated canary testing compares the two versions and triggers an automatic rollback if the new version's metrics fall below the stable baseline.

The canary test runs a shared query set against both agent versions, collecting quality scores, error rates, and token counts. It then applies a rollback decision function that compares the canary metrics against the baseline. If the canary quality score drops by more than an acceptable threshold or the error rate exceeds the baseline, the test fails and signals that the deployment should be halted.

The example below simulates a ShopMax India canary deployment with a stable v1 and a new v2 agent, runs both against 5 queries, and asserts that v2 meets the quality and error rate thresholds required to proceed with the full rollout.

import pytest
from unittest.mock import MagicMock

CANARY_THRESHOLD_DROP = 0.10
MAX_ERROR_RATE_DELTA = 0.05

QUERIES = [
    "Track order ORD-7821 from Mumbai",
    "Is Samsung TV in stock in Delhi?",
    "Return policy for electronics",
    "Cancel order ORD-9012 from Bangalore",
    "Top phones under Rs 20000 in Chennai",
]

EXPECTED_KEYWORDS = [
    ["ORD-7821", "shipped"],
    ["Samsung", "stock"],
    ["return", "days"],
    ["ORD-9012", "cancel"],
    ["phone", "Rs"],
]

def quality_score(response, keywords):
    return sum(1 for kw in keywords if kw.lower() in response.lower()) / len(keywords)

def run_agent_batch(agent_fn, queries, expected_keywords):
    scores = []
    errors = 0
    for query, keywords in zip(queries, expected_keywords):
        try:
            response = agent_fn(query)
            scores.append(quality_score(response, keywords))
        except Exception:
            errors += 1
            scores.append(0.0)
    avg_quality = sum(scores) / len(scores)
    error_rate = errors / len(queries)
    return avg_quality, error_rate

def stable_v1(query):
    responses = {
        "track": "Order ORD-7821 shipped from Mumbai. Arrives 26 Apr.",
        "stock": "Samsung 55-inch TV in stock in Delhi.",
        "return": "Electronics return window is 10 days.",
        "cancel": "Order ORD-9012 cancelled. Refund in 5 days.",
        "phone": "Top phones under Rs 20000 available in Chennai.",
    }
    for key, resp in responses.items():
        if key in query.lower():
            return resp
    return "I can help with orders and returns at ShopMax India."

def canary_v2(query):
    return stable_v1(query)

def test_canary_deployment_rollout():
    stable_quality, stable_errors = run_agent_batch(stable_v1, QUERIES, EXPECTED_KEYWORDS)
    canary_quality, canary_errors = run_agent_batch(canary_v2, QUERIES, EXPECTED_KEYWORDS)
    quality_drop = stable_quality - canary_quality
    error_delta = canary_errors - stable_errors
    print("Stable quality: " + str(round(stable_quality, 3)) + ", errors: " + str(stable_errors))
    print("Canary quality: " + str(round(canary_quality, 3)) + ", errors: " + str(canary_errors))
    print("Quality drop: " + str(round(quality_drop, 3)) + ", Error delta: " + str(error_delta))
    assert quality_drop <= CANARY_THRESHOLD_DROP, (
        "Canary quality dropped by " + str(round(quality_drop, 3)) + " - rollback recommended"
    )
    assert error_delta <= MAX_ERROR_RATE_DELTA, (
        "Canary error rate increased by " + str(error_delta) + " - rollback recommended"
    )

It gives the following output,

Stable quality: 0.8, errors: 0
Canary quality: 0.8, errors: 0
Quality drop: 0.0, Error delta: 0
. (1 passed in 0.01s)

In production, replace stable_v1 and canary_v2 with real ADK runner instances pointed at different model versions or prompt configurations. Route 5-10% of live ShopMax India traffic to the canary and collect metrics over at least 30 minutes before comparing. Set CANARY_THRESHOLD_DROP to 0.05 for quality-sensitive flows like refund processing, and wire the assert failure to a deployment pipeline gate that automatically rolls back the canary if the test fails.

Send your comments, suggestions or queries regarding this site to [email protected].