tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Guardrails and Evaluation > Continuous LLM Evaluation with MLflow

Continuous LLM Evaluation with MLflow

Author: Venkata Sudhakar

ShopMax India runs multiple experiments with different LLM models, prompt templates, and retrieval configs. Tracking quality metrics across these experiments manually is error-prone and makes it hard to compare prompt versions or detect regressions. MLflow's LLM evaluation module provides structured experiment tracking, metric logging, and comparison dashboards so every change to the AI pipeline is benchmarked against a golden dataset.

MLflow's mlflow.evaluate() function runs automated metrics against a dataset of prompts and reference answers. It supports built-in metrics including rouge1, rouge2, rougeL, toxicity, and flesch_kincaid_grade_level. Results are logged as MLflow runs and stored in a local or remote tracking server. The MLflow UI provides side-by-side comparison tables and charts for metric values across prompt versions, model names, and retrieval configs.

The example below sets up an MLflow evaluation experiment for ShopMax India's product FAQ bot. It logs model parameters, runs evaluation against a 3-question golden dataset, and prints the resulting ROUGE and toxicity metrics.


It gives the following output,

MLflow Evaluation - ShopMax India FAQ
========================================
rouge1/v1/mean                     : 0.6821
rouge1/v1/variance                 : 0.0043
toxicity/v1/mean                   : 0.0012
toxicity/v1/variance               : 0.0001
latency/mean                       : 1.2340

In production, run mlflow.evaluate() as part of every CI/CD pipeline stage that changes a prompt template or model version. Use the MLflow UI (mlflow ui command) to compare metric tables across runs and set alerts when rouge1 drops more than 0.05 from baseline. Store the tracking server in a shared location (S3 or GCS backend) so all team members can view experiment history. Tag each run with the git commit hash and the prompt template version to make regression debugging straightforward.


 
  


  
bl  br