|
|
Real-Time LLM Quality Scoring with Custom Metrics in Python
Author: Venkata Sudhakar
ShopMax India's order tracking agent sometimes produces vague or off-topic responses that frustrate customers before a human supervisor can intervene. Rather than relying on post-hoc evaluation, a real-time quality scorer can assess each response before it is delivered, flagging low-confidence answers for human review or triggering an automatic retry. This tutorial shows how to build a lightweight quality scoring pipeline using rule-based checks and an LLM-as-judge pattern.
The scoring pipeline runs three checks on every response: a keyword relevance check (does the response mention key entities from the question?), a length check (is the response too short to be useful or too long to be readable?), and an LLM-as-judge score (a second LLM call that rates the response on a 1-5 scale). A composite score decides whether to deliver, retry, or escalate to a human agent. The overhead is one small LLM call per response, adding roughly 100-150ms but preventing bad responses from reaching customers.
The example below shows the quality scorer for ShopMax India order queries. Responses scoring below 0.6 trigger a retry or escalation to a human support agent.
It gives the following output,
Keyword score: 0.25
Length score: 0.20
LLM judge: 0.40
Composite: 0.35
Action: RETRY or ESCALATE
Tune the weights based on your business priorities - for ShopMax India order queries, LLM judge accuracy matters most, so give it 60% weight. Cache judge scores for identical responses to avoid redundant API calls. Log all scores to a time-series database like InfluxDB or BigQuery so you can track quality trends over time and alert when the 7-day rolling average composite score drops below 0.6.
|
|