|
|
Quality Scorecards for ADK Agents - Tracking Metrics Across Releases
Author: Venkata Sudhakar
Quality scorecards provide a structured summary of ADK agent performance across multiple dimensions - accuracy, helpfulness, latency, and safety - tracked across releases so regressions are visible before deployment. ShopMax India generates a quality scorecard after every release candidate build to compare the new agent version against the production baseline across all customer-facing agents in Mumbai, Delhi, and Hyderabad.
A scorecard is a dict mapping metric names to values for a given agent version. Scorecard comparison diffs two dicts and flags any metric that degraded beyond a tolerance band. Metrics are computed by running the agent against a fixed evaluation dataset and aggregating results. The scorecard is serialized to JSON and stored as a CI artifact so historical trends can be plotted over time.
The example below computes a quality scorecard for a ShopMax India order agent, compares it against a baseline scorecard, and asserts no metric regressed beyond its allowed tolerance.
It gives the following output,
Scorecard v1.5.0: accuracy=0.93, p95=190.0ms
1 passed in 0.03s
Store baseline scorecards in a scorecards/ directory under version control so that the comparison is always against the last released version, not an arbitrary snapshot. Add a scorecard summary step to the CI pipeline that prints a diff table to the PR comment so reviewers see the quality impact at a glance. Track the scorecard trend over 10 releases to spot slow drift that individual release comparisons miss.
|
|