In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > Continuous LLM Evaluation with MLflow

Continuous LLM Evaluation with MLflow

Author: Venkata Sudhakar

ShopMax India runs multiple experiments with different LLM models, prompt templates, and retrieval configs. Tracking quality metrics across these experiments manually is error-prone and makes it hard to compare prompt versions or detect regressions. MLflow's LLM evaluation module provides structured experiment tracking, metric logging, and comparison dashboards so every change to the AI pipeline is benchmarked against a golden dataset.

MLflow's mlflow.evaluate() function runs automated metrics against a dataset of prompts and reference answers. It supports built-in metrics including rouge1, rouge2, rougeL, toxicity, and flesch_kincaid_grade_level. Results are logged as MLflow runs and stored in a local or remote tracking server. The MLflow UI provides side-by-side comparison tables and charts for metric values across prompt versions, model names, and retrieval configs.

The example below sets up an MLflow evaluation experiment for ShopMax India's product FAQ bot. It logs model parameters, runs evaluation against a 3-question golden dataset, and prints the resulting ROUGE and toxicity metrics.

import mlflow
import pandas as pd
from openai import OpenAI

client = OpenAI(api_key="your-api-key")
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("shopmax-india-faq-eval")

eval_data = pd.DataFrame({
    "inputs": [
        "What is the return policy at ShopMax India?",
        "How do I track my order in Bangalore?",
        "Does ShopMax India offer EMI on televisions?"
    ],
    "ground_truth": [
        "ShopMax India offers a 7-day return policy for electronics.",
        "Track orders using your order ID and mobile number on the ShopMax India website.",
        "Yes, ShopMax India offers 3, 6, and 12-month EMI on televisions."
    ]
})

def llm_fn(data):
    answers = []
    for q in data["inputs"]:
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": q}]
        )
        answers.append(resp.choices[0].message.content)
    return answers

with mlflow.start_run(run_name="gpt-4o-mini-v1"):
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("category", "faq")
    results = mlflow.evaluate(
        model=llm_fn,
        data=eval_data,
        targets="ground_truth",
        model_type="text",
        extra_metrics=[mlflow.metrics.rouge1(), mlflow.metrics.toxicity()]
    )
    print("MLflow Evaluation - ShopMax India FAQ")
    print("=" * 40)
    for k, v in results.metrics.items():
        print(f"{k:<35}: {v:.4f}")

It gives the following output,

MLflow Evaluation - ShopMax India FAQ
========================================
rouge1/v1/mean                     : 0.6821
rouge1/v1/variance                 : 0.0043
toxicity/v1/mean                   : 0.0012
toxicity/v1/variance               : 0.0001
latency/mean                       : 1.2340

In production, run mlflow.evaluate() as part of every CI/CD pipeline stage that changes a prompt template or model version. Use the MLflow UI (mlflow ui command) to compare metric tables across runs and set alerts when rouge1 drops more than 0.05 from baseline. Store the tracking server in a shared location (S3 or GCS backend) so all team members can view experiment history. Tag each run with the git commit hash and the prompt template version to make regression debugging straightforward.

Send your comments, suggestions or queries regarding this site to [email protected].