tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > AI Observability > Detecting LLM Hallucinations in Production with DeepEval

Detecting LLM Hallucinations in Production with DeepEval

Author: Venkata Sudhakar

ShopMax India's AI chatbot handles sensitive queries - return requests, payment issues, personal order details. If the LLM hallucinates a wrong return policy or generates a toxic response to a frustrated customer, it damages trust and can create legal liability. DeepEval is an open-source LLM testing framework that runs automated checks on every LLM response for hallucination, toxicity, bias, and answer correctness. ShopMax India uses DeepEval in CI to gate every deployment: if any metric drops below threshold, the pipeline fails and the team is alerted before the bad model reaches production customers.

DeepEval works by defining test cases with an input, actual_output, and optionally expected_output and retrieval_context. Each test case is scored by one or more metrics: HallucinationMetric (checks if the output contradicts the context), ToxicityMetric (checks for harmful language), AnswerRelevancyMetric (checks topical relevance), and GEval (a customizable LLM-as-judge metric). The assert_test() function runs all metrics and raises an assertion error if any score falls below the configured minimum threshold, making it CI-compatible.

The example below runs DeepEval checks on three ShopMax India chatbot responses covering a product query, a return policy question, and a stress-test case where the LLM might hallucinate. It prints pass/fail status and scores for each metric.


It gives the following output,

DeepEval Results for ShopMax India Chatbot:
-------------------------------------------------------
Test 1: What is the warranty period for Samsung TVs at...
  Hallucination: PASS (0.0)
  Relevancy:     PASS (0.96)
  Toxicity:      PASS (0.0)

Test 2: Can I return a product bought in Mumbai at a De...
  Hallucination: FAIL (0.8)
  Relevancy:     PASS (0.88)
  Toxicity:      PASS (0.0)

Test 3: I am angry my order is late, this service is te...
  Hallucination: PASS (0.1)
  Relevancy:     PASS (0.91)
  Toxicity:      PASS (0.02)

Test 2 fails hallucination because the model claimed cross-city returns are allowed when the context explicitly states otherwise - this is exactly the regression DeepEval is designed to catch. In production, integrate DeepEval with pytest: use the @pytest.mark.parametrize decorator to run the full test suite and call assert_test(test_case, metrics) in the test body so the CI pipeline fails on regressions. For ShopMax India, maintain a golden dataset of 200+ test cases covering return policies, product specs, and edge cases like frustrated customer inputs, and run the suite on every model or prompt change before releasing to production.


 
  


  
bl  br