In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Agentic AI > ADK Agent Testing > ADK Agent Test Strategy - Automation vs Manual Testing

ADK Agent Test Strategy - Automation vs Manual Testing

Author: Venkata Sudhakar

ShopMax India's engineering team frequently asks: can all ADK agent testing be automated, or is manual review always required? The answer depends on what you are testing. Automation excels at structural correctness, regression detection, performance benchmarks, and high-volume coverage. Manual testing remains essential for tone and empathy assessment, novel edge cases outside the training distribution, cultural sensitivity, and evaluating whether an agent response feels right to a real customer.

Automated testing covers four layers: unit tests for individual tools, integration tests for agent pipelines with mocked LLMs, regression tests using golden datasets with LLM-as-judge scoring, and contract tests for output schema validation. These catch 80-90% of regressions and run in CI pipelines within minutes. Manual testing fills the gap: human reviewers sample 50-100 live responses per week, focus on low-confidence outputs, and test scenarios that require common sense or cultural awareness that automated metrics cannot capture.

The example below shows ShopMax India's test strategy as a triage function. It routes each validation to the appropriate method - automated checks or a manual review queue - based on confidence score, query category, and keyword signals.

from enum import Enum
from dataclasses import dataclass

class ValidationMethod(Enum):
    AUTOMATED = "automated"
    MANUAL_REVIEW = "manual_review"
    BOTH = "both"

@dataclass
class TestCase:
    query: str
    response: str
    confidence: float
    category: str

def triage_validation(test_case):
    if test_case.confidence < 0.6:
        return ValidationMethod.MANUAL_REVIEW
    manual_keywords = ["complaint", "angry", "refund", "legal", "fraud", "damaged"]
    if any(kw in test_case.query.lower() for kw in manual_keywords):
        return ValidationMethod.BOTH
    if test_case.category in ["order_status", "stock_check", "price_query"]:
        return ValidationMethod.AUTOMATED
    return ValidationMethod.BOTH

def run_automated_checks(test_case):
    results = {}
    results["non_empty"] = len(test_case.response) > 10
    results["no_errors"] = "traceback" not in test_case.response.lower()
    results["length_ok"] = len(test_case.response) < 1000
    results["all_passed"] = all(results.values())
    return results

SAMPLE_CASES = [
    TestCase("Track order ORD-7821", "Your order ORD-7821 is dispatched.", 0.95, "order_status"),
    TestCase("I want to file a fraud complaint", "Please contact our fraud team.", 0.72, "complaint"),
    TestCase("Is Galaxy S24 in stock?", "Yes, available in Bangalore.", 0.88, "stock_check"),
    TestCase("My order arrived damaged", "Sorry to hear that.", 0.45, "complaint"),
]

for case in SAMPLE_CASES:
    method = triage_validation(case)
    if method in [ValidationMethod.AUTOMATED, ValidationMethod.BOTH]:
        result = run_automated_checks(case)
        status = "PASS" if result["all_passed"] else "FAIL"
    else:
        status = "QUEUED_FOR_REVIEW"
    print(case.query[:40].ljust(42) + " -> " + method.value + " [" + status + "]")

It gives the following output,

Track order ORD-7821                       -> automated [PASS]
I want to file a fraud complaint           -> both [PASS]
Is Galaxy S24 in stock?                    -> automated [PASS]
My order arrived damaged                   -> manual_review [QUEUED_FOR_REVIEW]

Set a weekly manual review quota - 50 responses is enough to catch systematic failures that automated tests miss. Prioritize low-confidence responses, new feature areas, and complaint categories for manual review. When a manual reviewer finds a failure, add it to the golden regression set immediately so the same issue never passes undetected again. Automated testing is not a replacement for manual review - it is a force multiplier that lets your team focus human attention where it matters most. Track the ratio of automated-to-manual catches over time: if manual review consistently finds things automation misses, invest in better automated metrics for that category.

Send your comments, suggestions or queries regarding this site to [email protected].