|
|
Golden Dataset Testing for ADK Agents
Author: Venkata Sudhakar
ShopMax India maintains a golden dataset of 200 curated order queries with known correct responses, used to evaluate ADK agent quality on every prompt change or model upgrade. Golden dataset testing compares the agent's actual responses against these ground-truth answers, tracks pass rates over time, and alerts when a new version regresses below the acceptable threshold - giving the team confidence before every deployment.
A golden dataset test runs each query through the agent, compares the response to the expected answer using an exact match or keyword match strategy, and records pass, fail, and score. Use pytest parametrize to run all dataset cases in a single test function. Track the dataset pass rate as a metric and set a minimum threshold (e.g. 95 percent) that must pass before a deployment proceeds. Store the dataset in a JSON or CSV file versioned alongside the prompt.
The example below defines a five-entry golden dataset for ShopMax India's order tracking agent. Each entry has an input query, expected keywords in the response, and a city. The test runs all entries through a mock agent and asserts that the pass rate meets the 80 percent threshold (lowered for illustration).
It gives the following output,
PASS: Track ORD-1001 -> ORD-1001 status is Shipped.
PASS: Track ORD-1002 -> ORD-1002 status is Delivered.
PASS: Track ORD-1003 -> ORD-1003 status is Shipped.
PASS: Track ORD-1004 -> ORD-1004 status is Processing.
PASS: Track ORD-1005 -> ORD-1005 status is Shipped.
Pass rate: 100.0%
. (1 passed in 0.01s)
In production, ShopMax India should store the golden dataset in a versioned file (golden_v3.json) so that dataset changes are auditable in git. Set the pass rate threshold at 95 percent for production deployments and 85 percent for staging. When a new prompt or model causes a regression, investigate each failing case individually rather than lowering the threshold - regressions usually cluster around specific intent types like returns or EMI queries that need targeted prompt fixes.
|
|