tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Agentic AI > ADK Agent Testing > Calibration Testing for ADK Agent Confidence Scores

Calibration Testing for ADK Agent Confidence Scores

Author: Venkata Sudhakar

Calibration testing verifies that an ADK agent's confidence scores actually match its observed accuracy - a well-calibrated agent that says it is 80% confident should be correct about 80% of the time. ShopMax India calibrates its refund eligibility and stock availability agents because overconfident agents (claiming 95% confidence when accuracy is 70%) mislead downstream systems into skipping human review on decisions that should be escalated for customers in Hyderabad and Chennai.

Calibration is measured using Expected Calibration Error (ECE): group predictions into confidence bins (e.g. 0.6-0.7, 0.7-0.8, 0.8-0.9), compute the average accuracy within each bin, and measure the gap between the bin confidence midpoint and the bin accuracy. A perfectly calibrated model has ECE of 0. ECE below 0.05 is considered well-calibrated for production agents. The test asserts ECE stays within the acceptable band after each release.

The example below defines an ECE calculator, runs it against a synthetic set of confidence-prediction pairs for a ShopMax India stock agent, and asserts ECE is below the production threshold.


It gives the following output,

ECE: 0.0400, Accuracy: 0.55, n=20
Perfect calibration ECE: 0.0000
2 passed in 0.05s

Collect (confidence, outcome) pairs from production traffic logs weekly and recompute ECE in a monitoring job. If ECE drifts above 0.10, retrain or recalibrate the confidence scoring function using Platt scaling or isotonic regression. For multi-class agents (e.g. routing to order, return, or search), compute per-class ECE separately since a model can be well-calibrated overall but badly calibrated on a specific high-stakes class like refund decisions.


 
  


  
bl  br