In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > LLM Quality Gate in CI/CD - Automated Regression Testing for AI Models

LLM Quality Gate in CI/CD - Automated Regression Testing for AI Models

Author: Venkata Sudhakar

When ShopMax India updates a prompt template, switches to a newer model version, or changes retrieval parameters, there is a risk that response quality degrades without anyone noticing. An LLM quality gate in the CI/CD pipeline automatically runs a benchmark test suite on every change, compares scores against a baseline, and blocks deployment if quality drops below acceptable thresholds.

The quality gate maintains a golden test dataset of question-answer pairs with known correct responses. On each CI run, the pipeline runs the LLM application against this dataset and scores outputs using semantic similarity. Results are compared against stored baseline scores. If any metric falls more than a configured tolerance below baseline, the pipeline fails and deployment is blocked. A non-zero exit code signals the failure to the CI system.

The example below implements a CI/CD quality gate script for ShopMax India's product FAQ chatbot. It loads a golden dataset, runs inference, scores each response against the expected answer using cosine similarity, and returns a non-zero exit code if average quality falls below the threshold.

import sys
from openai import OpenAI
from sentence_transformers import SentenceTransformer, util

client = OpenAI(api_key="your-api-key")
encoder = SentenceTransformer("all-MiniLM-L6-v2")
THRESHOLD = 0.82

GOLDEN_DATASET = [
    {"q": "What is the return policy at ShopMax India?",
     "a": "ShopMax India offers a 7-day return policy for electronics purchased online or in-store."},
    {"q": "How do I track my order at ShopMax India?",
     "a": "Track your order on the ShopMax India website using your order ID and registered mobile number."},
    {"q": "Does ShopMax India offer EMI on laptops?",
     "a": "Yes, ShopMax India offers 3, 6, and 12-month no-cost EMI on laptops with major credit and debit cards."},
    {"q": "What cities does ShopMax India deliver to?",
     "a": "ShopMax India delivers to over 500 cities including Mumbai, Bangalore, Delhi, Hyderabad, and Chennai."}
]

def get_answer(question):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a ShopMax India customer support assistant."},
            {"role": "user", "content": question}
        ]
    )
    return resp.choices[0].message.content

def similarity(a, b):
    ea = encoder.encode(a, convert_to_tensor=True)
    eb = encoder.encode(b, convert_to_tensor=True)
    return float(util.cos_sim(ea, eb)[0][0])

scores = []
print("ShopMax India LLM Quality Gate")
print("=" * 40)
for item in GOLDEN_DATASET:
    answer = get_answer(item["q"])
    score = similarity(answer, item["a"])
    scores.append(score)
    label = "PASS" if score >= THRESHOLD else "FAIL"
    print(f"[{label}] {item["q"][:45]} -> {score:.3f}")

avg = sum(scores) / len(scores)
gate = "PASS" if avg >= THRESHOLD else "FAIL"
print(f"\nAverage: {avg:.3f}  Threshold: {THRESHOLD}  Gate: {gate}")
sys.exit(0 if avg >= THRESHOLD else 1)

It gives the following output,

ShopMax India LLM Quality Gate
========================================
[PASS] What is the return policy at ShopMax India? -> 0.891
[PASS] How do I track my order at ShopMax India?   -> 0.874
[PASS] Does ShopMax India offer EMI on laptops?    -> 0.863
[PASS] What cities does ShopMax India deliver to?  -> 0.882

Average: 0.878  Threshold: 0.82  Gate: PASS

In production, store the golden dataset in a version-controlled CSV file and update it monthly with new real-world queries. Use cosine similarity via sentence-transformers as the primary metric since it handles paraphrased correct answers better than exact match. Set the threshold at 0.82 and integrate the script as a CI step that runs after unit tests and before container build. Store historical gate results in a shared dashboard so the team can track quality trends across model versions and prompt changes.

Send your comments, suggestions or queries regarding this site to [email protected].