|
|
LLM Quality Gate in CI/CD - Automated Regression Testing for AI Models
Author: Venkata Sudhakar
When ShopMax India updates a prompt template, switches to a newer model version, or changes retrieval parameters, there is a risk that response quality degrades without anyone noticing. An LLM quality gate in the CI/CD pipeline automatically runs a benchmark test suite on every change, compares scores against a baseline, and blocks deployment if quality drops below acceptable thresholds.
The quality gate maintains a golden test dataset of question-answer pairs with known correct responses. On each CI run, the pipeline runs the LLM application against this dataset and scores outputs using semantic similarity. Results are compared against stored baseline scores. If any metric falls more than a configured tolerance below baseline, the pipeline fails and deployment is blocked. A non-zero exit code signals the failure to the CI system.
The example below implements a CI/CD quality gate script for ShopMax India's product FAQ chatbot. It loads a golden dataset, runs inference, scores each response against the expected answer using cosine similarity, and returns a non-zero exit code if average quality falls below the threshold.
It gives the following output,
ShopMax India LLM Quality Gate
========================================
[PASS] What is the return policy at ShopMax India? -> 0.891
[PASS] How do I track my order at ShopMax India? -> 0.874
[PASS] Does ShopMax India offer EMI on laptops? -> 0.863
[PASS] What cities does ShopMax India deliver to? -> 0.882
Average: 0.878 Threshold: 0.82 Gate: PASS
In production, store the golden dataset in a version-controlled CSV file and update it monthly with new real-world queries. Use cosine similarity via sentence-transformers as the primary metric since it handles paraphrased correct answers better than exact match. Set the threshold at 0.82 and integrate the script as a CI step that runs after unit tests and before container build. Store historical gate results in a shared dashboard so the team can track quality trends across model versions and prompt changes.
|
|