In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > AI Observability > LLM Regression Testing with Promptfoo

LLM Regression Testing with Promptfoo

Author: Venkata Sudhakar

ShopMax India's customer service chatbot went live with GPT-4o but the team wants to evaluate whether switching to a cheaper model will degrade response quality. Without a structured regression testing process, every prompt change or model swap is a risk. Promptfoo is an open-source CLI tool for evaluating LLM outputs against defined test cases. It runs your prompts against multiple models, grades responses using custom assertions, and flags regressions before they reach production.

Promptfoo works by defining a YAML config file with prompts, providers (models), and test cases. Each test case has an input and one or more assertions - string matching, LLM-graded scoring, or custom JavaScript checks. When you run promptfoo eval, it executes all combinations and produces a report showing pass/fail rates, latency, and cost per model. You can integrate it into CI/CD pipelines to block deployments if pass rate drops below a threshold.

The example below shows a Promptfoo setup for ShopMax India's return policy assistant. We test two models against three customer questions and write results to a JSON report using Python to generate the config and run the evaluation.

import subprocess
import json
import os

config_lines = [
    "providers:",
    "  - id: openai:gpt-4o-mini",
    "  - id: openai:gpt-3.5-turbo",
    "",
    "prompts:",
    "  - You are ShopMax India returns assistant. Answer: {{question}}",
    "",
    "tests:",
    "  - vars:",
    "      question: Can I return a TV after 15 days?",
    "    assert:",
    "      - type: contains",
    "        value: 7 days",
    "  - vars:",
    "      question: My laptop screen is cracked. Do I get a replacement?",
    "    assert:",
    "      - type: contains-any",
    "        value: [warranty, replacement, repair]",
    "  - vars:",
    "      question: I ordered in Delhi. Can I return it in Mumbai?",
    "    assert:",
    "      - type: llm-rubric",
    "        value: Response should address cross-city returns clearly",
]

config = "
".join(config_lines)
with open("promptfooconfig.yaml", "w") as f:
    f.write(config)

result = subprocess.run(
    ["npx", "promptfoo@latest", "eval", "--output", "results.json"],
    capture_output=True,
    text=True,
    env={**os.environ, "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY")}
)

with open("results.json") as f:
    data = json.load(f)

stats = data["results"]["stats"]
total = stats["successes"] + stats["failures"]
print(f"Tests run: {total}")
print(f"Passed:    {stats["successes"]}")
print(f"Failed:    {stats["failures"]}")
print(f"Pass rate: {stats["successes"] / total * 100:.1f}%")
print("Regression test complete - check results.json for full report.")

It gives the following output,

Tests run: 6
Passed:    5
Failed:    1
Pass rate: 83.3%
Regression test complete - check results.json for full report.

Integrate Promptfoo into GitHub Actions to block merges when pass rate drops below 80%. Store baseline results in version control and use promptfoo eval --grader openai:gpt-4o for LLM-as-judge semantic scoring, not just string matching. For ShopMax India, maintain separate test suites per agent type - returns, pricing, and inventory - so regressions are caught at the module level before they affect the full customer experience.

Send your comments, suggestions or queries regarding this site to [email protected].