|
|
Calibration Testing for ADK Agent Confidence Scores
Author: Venkata Sudhakar
Calibration testing verifies that an ADK agent's confidence scores actually match its observed accuracy - a well-calibrated agent that says it is 80% confident should be correct about 80% of the time. ShopMax India calibrates its refund eligibility and stock availability agents because overconfident agents (claiming 95% confidence when accuracy is 70%) mislead downstream systems into skipping human review on decisions that should be escalated for customers in Hyderabad and Chennai.
Calibration is measured using Expected Calibration Error (ECE): group predictions into confidence bins (e.g. 0.6-0.7, 0.7-0.8, 0.8-0.9), compute the average accuracy within each bin, and measure the gap between the bin confidence midpoint and the bin accuracy. A perfectly calibrated model has ECE of 0. ECE below 0.05 is considered well-calibrated for production agents. The test asserts ECE stays within the acceptable band after each release.
The example below defines an ECE calculator, runs it against a synthetic set of confidence-prediction pairs for a ShopMax India stock agent, and asserts ECE is below the production threshold.
It gives the following output,
ECE: 0.0400, Accuracy: 0.55, n=20
Perfect calibration ECE: 0.0000
2 passed in 0.05s
Collect (confidence, outcome) pairs from production traffic logs weekly and recompute ECE in a monitoring job. If ECE drifts above 0.10, retrain or recalibrate the confidence scoring function using Platt scaling or isotonic regression. For multi-class agents (e.g. routing to order, return, or search), compute per-class ECE separately since a model can be well-calibrated overall but badly calibrated on a specific high-stakes class like refund decisions.
|
|