In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Confidence Scoring for ADK Agent Decisions

Confidence Scoring for ADK Agent Decisions

Author: Venkata Sudhakar

Confidence scoring attaches a certainty estimate to each ADK agent decision so that low-confidence responses can be flagged for human review before being shown to customers. ShopMax India uses confidence scores on its return eligibility and refund calculation agents - decisions below 0.80 confidence are routed to a human agent in the Hyderabad support center rather than automated, protecting customers from erroneous outcomes.

A confidence score is computed from signals available at decision time: the number of matching records found, whether required fields were present, and whether the input fell within the training distribution. The score is a float from 0.0 to 1.0 attached to the tool response dict. Tests verify three things: high-confidence cases produce scores above the threshold, low-confidence cases fall below it, and edge cases return a score with the correct escalation flag set.

The example below defines a return eligibility tool that computes a confidence score from order age, payment status, and item condition, then runs three test cases asserting correct confidence bands and escalation routing.

import pytest
from dataclasses import dataclass
from typing import Dict

HIGH_CONFIDENCE_THRESHOLD = 0.80

@dataclass
class ReturnDecision:
    eligible: bool
    confidence: float
    escalate: bool
    reason: str

def assess_return_eligibility(order_age_days: int, paid: bool, condition: str) -> ReturnDecision:
    score = 1.0
    reason_parts = []
    if order_age_days > 30:
        score -= 0.4
        reason_parts.append("order older than 30 days")
    if not paid:
        score -= 0.3
        reason_parts.append("payment not confirmed")
    if condition not in ("new", "unopened"):
        score -= 0.2
        reason_parts.append("item not in original condition")
    confidence = max(0.0, score)
    eligible = confidence >= HIGH_CONFIDENCE_THRESHOLD
    return ReturnDecision(
        eligible=eligible,
        confidence=round(confidence, 2),
        escalate=not eligible,
        reason=", ".join(reason_parts) if reason_parts else "all checks passed"
    )

def test_high_confidence_eligible_return():
    result = assess_return_eligibility(order_age_days=5, paid=True, condition="unopened")
    assert result.confidence >= HIGH_CONFIDENCE_THRESHOLD
    assert result.eligible is True
    assert result.escalate is False
    print(f"Confidence={result.confidence}, eligible={result.eligible}, escalate={result.escalate}")

def test_low_confidence_triggers_escalation():
    result = assess_return_eligibility(order_age_days=45, paid=False, condition="used")
    assert result.confidence < HIGH_CONFIDENCE_THRESHOLD
    assert result.escalate is True
    print(f"Confidence={result.confidence}, reason={result.reason}")

def test_borderline_case_escalates():
    result = assess_return_eligibility(order_age_days=28, paid=True, condition="opened")
    assert result.escalate is True
    print(f"Borderline: confidence={result.confidence}, reason={result.reason}")

It gives the following output,

Confidence=1.0, eligible=True, escalate=False
Confidence=0.0, reason=order older than 30 days, payment not confirmed, item not in original condition
Borderline: confidence=0.8, reason=item not in original condition
3 passed in 0.04s

In production, log confidence scores alongside every agent decision to build a calibration dataset over time. Use that dataset to adjust thresholds per agent type - a refund agent should have a higher threshold than a product recommendation agent because the cost of a wrong decision is much higher. Expose the confidence score in the API response so downstream systems can apply their own escalation policies.

Send your comments, suggestions or queries regarding this site to [email protected].