|
|
Token Budget Testing for ADK Agents - Measuring Input and Output Tokens
Author: Venkata Sudhakar
ShopMax India pays for every token processed by their ADK agents. Without token budget tests, a prompt change that adds 200 tokens to every response increases costs silently. Token budget testing measures input and output tokens per query type, sets per-query budgets, and fails the CI build when an agent exceeds its token allowance - giving the engineering team cost visibility before a change reaches production.
The google-generativeai SDK returns usage_metadata on every response with input_token_count and output_token_count. Token budget tests mock the LLM response and attach mock usage metadata, then assert that counts fall within defined limits. Per-query-type budgets differ: a simple order status query should use under 500 input tokens and 150 output tokens; a complex returns policy query may budget 800 input and 300 output tokens. Budgets are set from a baseline measurement run on 100 real queries.
The example shows ShopMax India defining token budgets per query category and asserting them in pytest. The LLM is mocked with a fake usage_metadata object so tests run without real API calls.
It gives the following output,
order_status - input: 320, output: 85
stock_check - input: 280, output: 60
returns_policy - input: 650, output: 210
Set token budgets based on a baseline measurement run on 100 real queries before introducing budget tests. Allow 20% headroom above the baseline to avoid flaky failures from natural LLM variability. Track token budget trends over time in your CI dashboard - a gradual increase in average tokens signals prompt bloat. When a budget test fails, diff the old and new system prompt to identify which additions are causing the token increase, then trim or consolidate them.
|
|