In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Tool Selection Quality Testing for ADK Agents

Tool Selection Quality Testing for ADK Agents

Author: Venkata Sudhakar

Tool selection quality testing verifies that an ADK agent calls the correct tool for a given user intent, not just that the tool returns the right result. ShopMax India's customer service agent has tools for order status, product search, and return initiation - and a wrong tool selection (e.g. initiating a return when the customer asked about delivery) causes real harm and must be caught in testing before it reaches customers in Delhi and Hyderabad.

The testing pattern instruments the agent's tool dispatch layer to record which tool was called for each input, then compares the recorded tool name against an expected tool name defined in a golden dataset. A ToolCallRecorder wraps each tool function and appends the call to a shared list. After the agent run, the test asserts that the recorded tool matches the expected one. This separates tool selection correctness from tool output correctness, making failures easier to diagnose.

The example below defines a ToolCallRecorder, wraps three ShopMax India tools, runs four test cases from a golden dataset, and asserts the correct tool was selected for each user intent.

import pytest
from dataclasses import dataclass, field
from typing import Callable, List

@dataclass
class ToolCallRecorder:
    calls: List[str] = field(default_factory=list)

def wrap(self, name: str, fn: Callable) -> Callable:
        def wrapper(*args, **kwargs):
            self.calls.append(name)
            return fn(*args, **kwargs)
        return wrapper

def get_order_status(order_id: str) -> str:
    return f"Order {order_id} is shipped"

def search_products(query: str) -> str:
    return f"Top result for {query}: LG TV Rs 55000"

def initiate_return(order_id: str) -> str:
    return f"Return initiated for {order_id}"

GOLDEN_DATASET = [
    {"intent": "order_status", "args": {"order_id": "ORD-001"}, "expected_tool": "get_order_status"},
    {"intent": "search",       "args": {"query": "4K TV"},      "expected_tool": "search_products"},
    {"intent": "return",       "args": {"order_id": "ORD-002"}, "expected_tool": "initiate_return"},
    {"intent": "order_status", "args": {"order_id": "ORD-003"}, "expected_tool": "get_order_status"},
]

TOOL_DISPATCH = {
    "order_status": ("get_order_status", get_order_status, "order_id"),
    "search":       ("search_products",  search_products,  "query"),
    "return":       ("initiate_return",  initiate_return,  "order_id"),
}

@pytest.mark.parametrize("case", GOLDEN_DATASET)
def test_correct_tool_selected(case):
    recorder = ToolCallRecorder()
    tool_name, tool_fn, arg_key = TOOL_DISPATCH[case["intent"]]
    wrapped = recorder.wrap(tool_name, tool_fn)
    wrapped(**case["args"])
    assert recorder.calls[-1] == case["expected_tool"], (
        f"Wrong tool: got {recorder.calls[-1]}, expected {case['expected_tool']}"
    )
    print(f"Intent={case['intent']} -> tool={recorder.calls[-1]} OK")

It gives the following output,

Intent=order_status -> tool=get_order_status OK
Intent=search -> tool=search_products OK
Intent=return -> tool=initiate_return OK
Intent=order_status -> tool=get_order_status OK
4 passed in 0.07s

Extend the golden dataset to cover edge cases like ambiguous intents where multiple tools could match. Track tool selection accuracy as a percentage metric in CI reports - a drop from 100% to 95% across a release is a signal worth investigating. For multi-turn conversations, record the full tool call sequence and assert the sequence matches the expected flow, not just the final call.

Send your comments, suggestions or queries regarding this site to [email protected].