|
|
LLM Regression Testing with Promptfoo
Author: Venkata Sudhakar
ShopMax India's customer service chatbot went live with GPT-4o but the team wants to evaluate whether switching to a cheaper model will degrade response quality. Without a structured regression testing process, every prompt change or model swap is a risk. Promptfoo is an open-source CLI tool for evaluating LLM outputs against defined test cases. It runs your prompts against multiple models, grades responses using custom assertions, and flags regressions before they reach production.
Promptfoo works by defining a YAML config file with prompts, providers (models), and test cases. Each test case has an input and one or more assertions - string matching, LLM-graded scoring, or custom JavaScript checks. When you run promptfoo eval, it executes all combinations and produces a report showing pass/fail rates, latency, and cost per model. You can integrate it into CI/CD pipelines to block deployments if pass rate drops below a threshold.
The example below shows a Promptfoo setup for ShopMax India's return policy assistant. We test two models against three customer questions and write results to a JSON report using Python to generate the config and run the evaluation.
It gives the following output,
Tests run: 6
Passed: 5
Failed: 1
Pass rate: 83.3%
Regression test complete - check results.json for full report.
Integrate Promptfoo into GitHub Actions to block merges when pass rate drops below 80%. Store baseline results in version control and use promptfoo eval --grader openai:gpt-4o for LLM-as-judge semantic scoring, not just string matching. For ShopMax India, maintain separate test suites per agent type - returns, pricing, and inventory - so regressions are caught at the module level before they affect the full customer experience.
|
|