|
|
Human Evaluation Workflows for LLM Applications
Author: Venkata Sudhakar
Automated metrics like RAGAS and ROUGE measure structural quality but miss nuance - whether a ShopMax India customer would actually find an answer helpful. Human evaluation collects ground-truth quality labels from annotators, enabling calibration of automated metrics and surfacing issues that machines miss such as tone, completeness, and cultural relevance for Indian customers.
A human evaluation workflow defines a rating rubric with dimensions like relevance, accuracy, and helpfulness scored 1 to 5. Annotators rate LLM responses against the rubric. Inter-annotator agreement (Cohen's kappa) measures label reliability - a kappa above 0.6 indicates good agreement. Aggregated human scores become the ground truth baseline that automated metrics are calibrated against, and periodic human eval runs validate that automated scores remain predictive.
The example below builds a ShopMax India annotation pipeline. It collects simulated ratings from two annotators across three quality dimensions, computes per-question averages, and measures inter-annotator agreement using Cohen's kappa on the relevance dimension.
It gives the following output,
ShopMax India - LLM Human Evaluation
========================================
Q1: What is the warranty on Samsung TVs at ShopMax
Relevance:5 Accuracy:5 Helpfulness:4
Q2: Do you deliver to Chennai?
Relevance:4 Accuracy:5 Helpfulness:4
Q3: What is the EMI option for laptops?
Relevance:5 Accuracy:4 Helpfulness:5
Cohen Kappa (relevance): 0.667 - Good agreement
In production, replace the simulated ratings with a lightweight annotation UI - a simple Flask form works well for internal teams. Run human evaluation monthly on 50 randomly sampled production queries. Compare human scores against automated RAGAS and ROUGE scores to validate the automated pipeline is still tracking human judgment. When kappa drops below 0.5, retrain annotators on the rubric with calibration examples. Store all annotations in a database table indexed by model version and date for longitudinal quality tracking.
|
|