In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > Token Efficiency Testing for ADK Agents - Maximizing Quality Per Token

Token Efficiency Testing for ADK Agents - Maximizing Quality Per Token

Author: Venkata Sudhakar

ShopMax India often maintains two or more versions of the same agent prompt - a detailed version that retrieves more facts but costs more tokens, and a concise version that saves on API spend but may miss key details. Token efficiency testing measures the quality-to-token ratio so the team can pick the version that delivers the most value per rupee spent on the API.

Quality score is calculated as the fraction of expected keywords found in the response. Efficiency ratio divides quality score by total tokens used, scaled by 1000 for readability. By comparing efficiency ratios across prompt versions, teams can identify which version gives the best balance of accuracy and cost before deploying to production.

The example below compares two versions of the order tracking prompt and asserts that version A has a higher efficiency ratio and meets the minimum quality threshold.

import pytest

def quality_score(response, expected_keywords):
    found = sum(1 for kw in expected_keywords if kw.lower() in response.lower())
    return found / len(expected_keywords)

def efficiency_ratio(quality, total_tokens):
    if total_tokens == 0:
        return 0.0
    return quality / total_tokens * 1000

PROMPT_VERSIONS = {
    "version_a": {
        "response": "Order ORD-7821 from Mumbai is shipped. Delivery by 26 Apr. Tracking: IN-9934-BLR.",
        "input_tokens": 310,
        "output_tokens": 45,
        "expected_keywords": ["ORD-7821", "shipped", "delivery", "tracking"],
    },
    "version_b": {
        "response": "Your order has been processed and is on its way.",
        "input_tokens": 420,
        "output_tokens": 35,
        "expected_keywords": ["ORD-7821", "shipped", "delivery", "tracking"],
    },
}

def test_token_efficiency_comparison():
    results = {}
    for name, data in PROMPT_VERSIONS.items():
        q = quality_score(data["response"], data["expected_keywords"])
        total = data["input_tokens"] + data["output_tokens"]
        ratio = efficiency_ratio(q, total)
        results[name] = {"quality": q, "total_tokens": total, "efficiency": ratio}
        print(name + " - quality: " + str(q) + ", tokens: " + str(total) + ", efficiency: " + str(round(ratio, 4)))
    assert results["version_a"]["efficiency"] > results["version_b"]["efficiency"], (
        "version_a should be more efficient than version_b"
    )

def test_minimum_quality_threshold():
    for name, data in PROMPT_VERSIONS.items():
        q = quality_score(data["response"], data["expected_keywords"])
        if name == "version_a":
            assert q >= 0.75, name + " quality " + str(q) + " below 0.75 threshold"

It gives the following output,

version_a - quality: 1.0, tokens: 355, efficiency: 2.8169
version_b - quality: 0.0, tokens: 455, efficiency: 0.0
.. (2 passed in 0.01s)

In production, collect expected_keywords from a golden dataset reviewed by the product team. Automate the efficiency comparison as a pre-merge check so that prompt rewrites that degrade quality-per-token are caught before reaching customers in Chennai, Hyderabad, or Delhi. Track the ratio over time to detect prompt drift as the model updates.

Send your comments, suggestions or queries regarding this site to [email protected].