tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Google Gemini API > ADK Agent Evaluation

ADK Agent Evaluation

Author: Venkata Sudhakar

Shipping an AI agent to production without evaluating its quality is risky. ADK includes an evaluation framework for defining test cases, running your agent against them, and scoring results automatically. You can evaluate final response quality and trajectory - did the agent call the right tools in the right order? Running evaluations before every deployment catches regressions when you change system instructions, add tools, or switch model versions. A systematic eval suite is the quality gate between development and production.

ADK evaluation uses EvalCase objects with an input, expected tool calls, and expected response keywords. AgentEvaluator scores tool call accuracy and response correctness. For subjective quality, use LLM-as-judge: a second Gemini call rates the response on helpfulness and accuracy. The evaluation produces a structured report showing per-case pass/fail, an overall score, and specific failure details - giving you actionable information to fix before deploying.

The below example builds an evaluation suite for the ShopMax customer service agent - 5 test cases covering order tracking, availability checks, out-of-stock scenarios, and error handling - with a final quality score and deployment recommendation.


Running the evaluation and analysing the quality report,


It gives the following evaluation report,

=== AGENT EVALUATION REPORT ===
PASS | active_order
PASS | delivered_order
PASS | product_in_stock
FAIL | product_out_of_stock
  Tool accuracy:      1.0
  Response accuracy:  0.4
  Missing in response: ["not available"]
PASS | invalid_order

Overall score: 80 percent (4/5 passed)
Recommendation: DEPLOY

# product_out_of_stock failed: agent said "only available in Bangalore"
# but did not explicitly say "not available in Delhi"
# Fix: update instruction to say "always state explicitly which cities have NO stock"
# Re-run eval - if 5/5 pass, deploy with confidence

Evaluation best practices: write 20-30 eval cases before launch covering happy path, edge cases, error handling, and adversarial inputs (questions outside scope). Run evals in your CI/CD pipeline so every git push to main triggers an evaluation and blocks deployment if the score drops below your threshold. When an eval case fails, fix the agent and re-run - do not remove the failing test case. Build your eval dataset from real production queries once you launch: logs of actual customer questions become your most realistic test cases. A score of 85 percent or higher on a well-designed eval set gives you strong confidence in production quality.


 
  


  
bl  br