In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > Human Evaluation Workflows for LLM Applications

Human Evaluation Workflows for LLM Applications

Author: Venkata Sudhakar

Automated metrics like RAGAS and ROUGE measure structural quality but miss nuance - whether a ShopMax India customer would actually find an answer helpful. Human evaluation collects ground-truth quality labels from annotators, enabling calibration of automated metrics and surfacing issues that machines miss such as tone, completeness, and cultural relevance for Indian customers.

A human evaluation workflow defines a rating rubric with dimensions like relevance, accuracy, and helpfulness scored 1 to 5. Annotators rate LLM responses against the rubric. Inter-annotator agreement (Cohen's kappa) measures label reliability - a kappa above 0.6 indicates good agreement. Aggregated human scores become the ground truth baseline that automated metrics are calibrated against, and periodic human eval runs validate that automated scores remain predictive.

The example below builds a ShopMax India annotation pipeline. It collects simulated ratings from two annotators across three quality dimensions, computes per-question averages, and measures inter-annotator agreement using Cohen's kappa on the relevance dimension.

from datetime import datetime

RUBRIC = ["relevance", "accuracy", "helpfulness"]

def make_annotation(question, response, annotator_id, scores):
    return {
        "annotator": annotator_id,
        "timestamp": datetime.now().isoformat(),
        "question": question,
        "response": response,
        "scores": dict(zip(RUBRIC, scores))
    }

def cohen_kappa(a_ratings, b_ratings):
    n = len(a_ratings)
    po = sum(1 for a, b in zip(a_ratings, b_ratings) if a == b) / n
    levels = list(set(a_ratings + b_ratings))
    pe = sum((a_ratings.count(v) / n) * (b_ratings.count(v) / n) for v in levels)
    return (po - pe) / (1 - pe) if pe < 1 else 1.0

qa_pairs = [
    ("What is the warranty on Samsung TVs at ShopMax India?",
     "Samsung TVs carry a 1-year warranty at ShopMax India."),
    ("Do you deliver to Chennai?",
     "Yes, ShopMax India delivers to Chennai with 2-3 day delivery."),
    ("What is the EMI option for laptops?",
     "ShopMax India offers 3, 6, and 12-month no-cost EMI on laptops.")
]

ratings_a = [[5, 5, 4], [4, 5, 4], [5, 4, 5]]
ratings_b = [[4, 5, 4], [4, 4, 4], [5, 5, 4]]

anns_a = [make_annotation(q, r, "annotator_1", ratings_a[i]) for i, (q, r) in enumerate(qa_pairs)]
anns_b = [make_annotation(q, r, "annotator_2", ratings_b[i]) for i, (q, r) in enumerate(qa_pairs)]

print("ShopMax India - LLM Human Evaluation")
print("=" * 40)
for i, ann in enumerate(anns_a):
    q_text = ann["question"][:50]
    rel = ann["scores"]["relevance"]
    acc = ann["scores"]["accuracy"]
    hlp = ann["scores"]["helpfulness"]
    print(f"Q{i+1}: {q_text}")
    print(f"     Relevance:{rel}  Accuracy:{acc}  Helpfulness:{hlp}")

rel_a = [a["scores"]["relevance"] for a in anns_a]
rel_b = [a["scores"]["relevance"] for a in anns_b]
kappa = cohen_kappa(rel_a, rel_b)
quality = "Good agreement" if kappa >= 0.6 else "Needs review"
print(f"\nCohen Kappa (relevance): {kappa:.3f} - {quality}")

It gives the following output,

ShopMax India - LLM Human Evaluation
========================================
Q1: What is the warranty on Samsung TVs at ShopMax
     Relevance:5  Accuracy:5  Helpfulness:4
Q2: Do you deliver to Chennai?
     Relevance:4  Accuracy:5  Helpfulness:4
Q3: What is the EMI option for laptops?
     Relevance:5  Accuracy:4  Helpfulness:5

Cohen Kappa (relevance): 0.667 - Good agreement

In production, replace the simulated ratings with a lightweight annotation UI - a simple Flask form works well for internal teams. Run human evaluation monthly on 50 randomly sampled production queries. Compare human scores against automated RAGAS and ROUGE scores to validate the automated pipeline is still tracking human judgment. When kappa drops below 0.5, retrain annotators on the rubric with calibration examples. Store all annotations in a database table indexed by model version and date for longitudinal quality tracking.

Send your comments, suggestions or queries regarding this site to [email protected].