In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Golden Dataset Testing for ADK Agents

Golden Dataset Testing for ADK Agents

Author: Venkata Sudhakar

ShopMax India maintains a golden dataset of 200 curated order queries with known correct responses, used to evaluate ADK agent quality on every prompt change or model upgrade. Golden dataset testing compares the agent's actual responses against these ground-truth answers, tracks pass rates over time, and alerts when a new version regresses below the acceptable threshold - giving the team confidence before every deployment.

A golden dataset test runs each query through the agent, compares the response to the expected answer using an exact match or keyword match strategy, and records pass, fail, and score. Use pytest parametrize to run all dataset cases in a single test function. Track the dataset pass rate as a metric and set a minimum threshold (e.g. 95 percent) that must pass before a deployment proceeds. Store the dataset in a JSON or CSV file versioned alongside the prompt.

The example below defines a five-entry golden dataset for ShopMax India's order tracking agent. Each entry has an input query, expected keywords in the response, and a city. The test runs all entries through a mock agent and asserts that the pass rate meets the 80 percent threshold (lowered for illustration).

import pytest

GOLDEN_DATASET = [
    {"query": "Track ORD-1001", "expected": ["ORD-1001", "Shipped"], "city": "Mumbai"},
    {"query": "Track ORD-1002", "expected": ["ORD-1002", "Delivered"], "city": "Delhi"},
    {"query": "Track ORD-1003", "expected": ["ORD-1003", "Shipped"], "city": "Bangalore"},
    {"query": "Track ORD-1004", "expected": ["ORD-1004", "Processing"], "city": "Hyderabad"},
    {"query": "Track ORD-1005", "expected": ["ORD-1005", "Shipped"], "city": "Chennai"},
]

STATUS_MAP = {
    "ORD-1001": "Shipped", "ORD-1002": "Delivered",
    "ORD-1003": "Shipped", "ORD-1004": "Processing", "ORD-1005": "Shipped"
}

def mock_agent(query):
    for order_id, status in STATUS_MAP.items():
        if order_id in query:
            return order_id + " status is " + status + "."
    return "Order not found."

def evaluate_response(response, expected_keywords):
    return all(kw in response for kw in expected_keywords)

def test_golden_dataset_pass_rate():
    results = []
    for case in GOLDEN_DATASET:
        response = mock_agent(case["query"])
        passed = evaluate_response(response, case["expected"])
        results.append({"case": case["query"], "passed": passed, "response": response})
        print(("PASS" if passed else "FAIL") + ": " + case["query"] + " -> " + response[:50])
    pass_rate = sum(1 for r in results if r["passed"]) / len(results)
    print("Pass rate: " + str(round(pass_rate * 100, 1)) + "%")
    assert pass_rate >= 0.80, "Golden dataset pass rate below threshold: " + str(pass_rate)

It gives the following output,

PASS: Track ORD-1001 -> ORD-1001 status is Shipped.
PASS: Track ORD-1002 -> ORD-1002 status is Delivered.
PASS: Track ORD-1003 -> ORD-1003 status is Shipped.
PASS: Track ORD-1004 -> ORD-1004 status is Processing.
PASS: Track ORD-1005 -> ORD-1005 status is Shipped.
Pass rate: 100.0%
. (1 passed in 0.01s)

In production, ShopMax India should store the golden dataset in a versioned file (golden_v3.json) so that dataset changes are auditable in git. Set the pass rate threshold at 95 percent for production deployments and 85 percent for staging. When a new prompt or model causes a regression, investigate each failing case individually rather than lowering the threshold - regressions usually cluster around specific intent types like returns or EMI queries that need targeted prompt fixes.

Send your comments, suggestions or queries regarding this site to [email protected].