|
|
Prompt Robustness Testing - Stress-Testing LLM Reliability
Author: Venkata Sudhakar
ShopMax India needs its product assistant to give stable answers even when customers phrase questions with typos, reordering, or different wording. Prompt robustness testing systematically generates semantically equivalent variations of each input prompt and measures consistency of LLM outputs to identify brittle behavior before it reaches production.
The approach generates paraphrased and perturbed versions of each test prompt, runs them through the LLM, and scores output consistency using cosine similarity of sentence embeddings. A high average similarity across variations means the LLM is robust to input phrasing. Low similarity indicates the model is sensitive to exact wording and may give inconsistent answers to the same underlying question.
The example below tests prompt robustness for ShopMax India's return policy FAQ. It runs 4 rephrased variations of the same question and scores each output against the base answer using semantic similarity.
It gives the following output,
Prompt Robustness Test - ShopMax India
========================================
Base Q : What is the return policy for electronics at ShopMax India?
Base A : ShopMax India offers a 7-day return policy for electronics...
[ROBUST] How can I return an electronics item bought at ShopMax
Similarity: 0.891
[ROBUST] What is ShopMax India return policy for electronics?
Similarity: 0.912
[ROBUST] Can I return electronics at ShopMax India and how?
Similarity: 0.873
[BRITTLE] return policy shopmax india electronics??
Similarity: 0.762
Average Robustness: 0.860
In production, run robustness tests on every new prompt template before deploying it. Maintain a library of 5 to 10 canonical variations per question category and set an average similarity threshold of 0.82. Prompts that score below threshold need reformulation - typically adding explicit instructions like output format requirements or context constraints. Schedule robustness regression runs weekly to catch silent degradation after model updates.
|
|