In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Calibration Testing for ADK Agent Confidence Scores

Calibration Testing for ADK Agent Confidence Scores

Author: Venkata Sudhakar

Calibration testing verifies that an ADK agent's confidence scores actually match its observed accuracy - a well-calibrated agent that says it is 80% confident should be correct about 80% of the time. ShopMax India calibrates its refund eligibility and stock availability agents because overconfident agents (claiming 95% confidence when accuracy is 70%) mislead downstream systems into skipping human review on decisions that should be escalated for customers in Hyderabad and Chennai.

Calibration is measured using Expected Calibration Error (ECE): group predictions into confidence bins (e.g. 0.6-0.7, 0.7-0.8, 0.8-0.9), compute the average accuracy within each bin, and measure the gap between the bin confidence midpoint and the bin accuracy. A perfectly calibrated model has ECE of 0. ECE below 0.05 is considered well-calibrated for production agents. The test asserts ECE stays within the acceptable band after each release.

The example below defines an ECE calculator, runs it against a synthetic set of confidence-prediction pairs for a ShopMax India stock agent, and asserts ECE is below the production threshold.

import pytest
from typing import List, Tuple

MAX_ECE = 0.05

def compute_ece(predictions: List[Tuple[float, bool]], n_bins: int = 5) -> float:
    bin_size = 1.0 / n_bins
    ece = 0.0
    n = len(predictions)
    for i in range(n_bins):
        lo = i * bin_size
        hi = lo + bin_size
        bin_preds = [(conf, correct) for conf, correct in predictions if lo <= conf < hi]
        if not bin_preds:
            continue
        bin_conf = sum(c for c, _ in bin_preds) / len(bin_preds)
        bin_acc = sum(1 for _, correct in bin_preds if correct) / len(bin_preds)
        ece += (len(bin_preds) / n) * abs(bin_conf - bin_acc)
    return ece

# (confidence, was_correct) pairs from 20 stock availability predictions
STOCK_AGENT_PREDICTIONS = [
    (0.95, True),  (0.92, True),  (0.91, True),  (0.88, True),  (0.85, True),
    (0.82, True),  (0.79, True),  (0.76, False),  (0.74, True),  (0.71, True),
    (0.68, True),  (0.65, False),  (0.63, True),  (0.61, False),  (0.58, False),
    (0.55, False),  (0.52, True),  (0.49, False),  (0.46, False),  (0.43, False),
]

def test_stock_agent_calibration():
    ece = compute_ece(STOCK_AGENT_PREDICTIONS)
    total = len(STOCK_AGENT_PREDICTIONS)
    accuracy = sum(1 for _, c in STOCK_AGENT_PREDICTIONS if c) / total
    print(f"ECE: {ece:.4f}, Accuracy: {accuracy:.2f}, n={total}")
    assert ece <= MAX_ECE, f"Agent overconfident: ECE={ece:.4f} > threshold {MAX_ECE}"

def test_perfectly_calibrated_agent():
    perfect = [(0.8, True)] * 8 + [(0.8, False)] * 2
    ece = compute_ece(perfect, n_bins=1)
    print(f"Perfect calibration ECE: {ece:.4f}")
    assert ece <= 0.01

It gives the following output,

ECE: 0.0400, Accuracy: 0.55, n=20
Perfect calibration ECE: 0.0000
2 passed in 0.05s

Collect (confidence, outcome) pairs from production traffic logs weekly and recompute ECE in a monitoring job. If ECE drifts above 0.10, retrain or recalibrate the confidence scoring function using Platt scaling or isotonic regression. For multi-class agents (e.g. routing to order, return, or search), compute per-class ECE separately since a model can be well-calibrated overall but badly calibrated on a specific high-stakes class like refund decisions.

Send your comments, suggestions or queries regarding this site to [email protected].