In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Token Budget Testing for ADK Agents - Measuring Input and Output Tokens

Token Budget Testing for ADK Agents - Measuring Input and Output Tokens

Author: Venkata Sudhakar

ShopMax India pays for every token processed by their ADK agents. Without token budget tests, a prompt change that adds 200 tokens to every response increases costs silently. Token budget testing measures input and output tokens per query type, sets per-query budgets, and fails the CI build when an agent exceeds its token allowance - giving the engineering team cost visibility before a change reaches production.

The google-generativeai SDK returns usage_metadata on every response with input_token_count and output_token_count. Token budget tests mock the LLM response and attach mock usage metadata, then assert that counts fall within defined limits. Per-query-type budgets differ: a simple order status query should use under 500 input tokens and 150 output tokens; a complex returns policy query may budget 800 input and 300 output tokens. Budgets are set from a baseline measurement run on 100 real queries.

The example shows ShopMax India defining token budgets per query category and asserting them in pytest. The LLM is mocked with a fake usage_metadata object so tests run without real API calls.

import pytest
import asyncio
from unittest.mock import MagicMock, AsyncMock

TOKEN_BUDGETS = {
    "order_status": {"max_input": 500, "max_output": 150},
    "stock_check": {"max_input": 400, "max_output": 100},
    "returns_policy": {"max_input": 800, "max_output": 300},
}

def make_mock_response(text, input_tokens, output_tokens):
    resp = MagicMock()
    resp.text = text
    resp.usage_metadata.input_token_count = input_tokens
    resp.usage_metadata.output_token_count = output_tokens
    return resp

async def call_agent(query, model):
    result = await model.generate_content_async(query)
    return {
        "text": result.text,
        "input_tokens": result.usage_metadata.input_token_count,
        "output_tokens": result.usage_metadata.output_token_count,
    }

@pytest.mark.parametrize("category,query,input_tok,output_tok", [
    ("order_status", "Track order ORD-7821", 320, 85),
    ("stock_check", "Is Samsung TV in stock in Delhi?", 280, 60),
    ("returns_policy", "What is the return policy for electronics?", 650, 210),
])
def test_token_budget(category, query, input_tok, output_tok):
    budget = TOKEN_BUDGETS[category]
    mock_model = MagicMock()
    mock_model.generate_content_async = AsyncMock(
        return_value=make_mock_response("Agent response text.", input_tok, output_tok)
    )
    result = asyncio.run(call_agent(query, mock_model))
    assert result["input_tokens"] <= budget["max_input"], (
        category + " input tokens exceeded: " + str(result["input_tokens"])
    )
    assert result["output_tokens"] <= budget["max_output"], (
        category + " output tokens exceeded: " + str(result["output_tokens"])
    )
    print(category + " - input: " + str(result["input_tokens"]) + ", output: " + str(result["output_tokens"]))

It gives the following output,

order_status - input: 320, output: 85
stock_check - input: 280, output: 60
returns_policy - input: 650, output: 210

Set token budgets based on a baseline measurement run on 100 real queries before introducing budget tests. Allow 20% headroom above the baseline to avoid flaky failures from natural LLM variability. Track token budget trends over time in your CI dashboard - a gradual increase in average tokens signals prompt bloat. When a budget test fails, diff the old and new system prompt to identify which additions are causing the token increase, then trim or consolidate them.

Send your comments, suggestions or queries regarding this site to [email protected].