In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > AI Observability > Real-Time LLM Quality Scoring with Custom Metrics in Python

Real-Time LLM Quality Scoring with Custom Metrics in Python

Author: Venkata Sudhakar

ShopMax India's order tracking agent sometimes produces vague or off-topic responses that frustrate customers before a human supervisor can intervene. Rather than relying on post-hoc evaluation, a real-time quality scorer can assess each response before it is delivered, flagging low-confidence answers for human review or triggering an automatic retry. This tutorial shows how to build a lightweight quality scoring pipeline using rule-based checks and an LLM-as-judge pattern.

The scoring pipeline runs three checks on every response: a keyword relevance check (does the response mention key entities from the question?), a length check (is the response too short to be useful or too long to be readable?), and an LLM-as-judge score (a second LLM call that rates the response on a 1-5 scale). A composite score decides whether to deliver, retry, or escalate to a human agent. The overhead is one small LLM call per response, adding roughly 100-150ms but preventing bad responses from reaching customers.

The example below shows the quality scorer for ShopMax India order queries. Responses scoring below 0.6 trigger a retry or escalation to a human support agent.

import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def keyword_score(question: str, response: str) -> float:
    q_words = set(question.lower().split())
    stop = {"the", "a", "an", "is", "in", "for", "to", "and", "of", "i"}
    key_words = q_words - stop
    hits = sum(1 for w in key_words if w in response.lower())
    return min(hits / max(len(key_words), 1), 1.0)

def length_score(response: str) -> float:
    words = len(response.split())
    if words < 10:
        return 0.2
    if words > 300:
        return 0.5
    return 1.0

def llm_judge_score(question: str, response: str) -> float:
    judge_prompt = (
        "Rate this customer support response on a scale of 1 to 5.
"
        "Output only: Score: N
"
        "Question: " + question + "
"
        "Response: " + response
    )
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt}]
    )
    text = result.choices[0].message.content
    for line in text.split("
"):
        if line.startswith("Score:"):
            return int(line.split(":")[1].strip()) / 5.0
    return 0.5

def quality_score(question: str, response: str) -> dict:
    kw = keyword_score(question, response)
    ln = length_score(response)
    lj = llm_judge_score(question, response)
    composite = (kw * 0.2) + (ln * 0.2) + (lj * 0.6)
    return {"keyword": kw, "length": ln, "llm_judge": lj, "composite": composite}

question = "Where is my order ORD-BLR-8821? It was placed 3 days ago from Bangalore."
response = "Your order is being processed."

scores = quality_score(question, response)
kw = scores["keyword"]
ln = scores["length"]
lj = scores["llm_judge"]
co = scores["composite"]
print(f"Keyword score:  {kw:.2f}")
print(f"Length score:   {ln:.2f}")
print(f"LLM judge:      {lj:.2f}")
print(f"Composite:      {co:.2f}")
print("Action:", "DELIVER" if co >= 0.6 else "RETRY or ESCALATE")

It gives the following output,

Keyword score:  0.25
Length score:   0.20
LLM judge:      0.40
Composite:      0.35
Action: RETRY or ESCALATE

Tune the weights based on your business priorities - for ShopMax India order queries, LLM judge accuracy matters most, so give it 60% weight. Cache judge scores for identical responses to avoid redundant API calls. Log all scores to a time-series database like InfluxDB or BigQuery so you can track quality trends over time and alert when the 7-day rolling average composite score drops below 0.6.

Send your comments, suggestions or queries regarding this site to [email protected].