In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > Prompt Robustness Testing - Stress-Testing LLM Reliability

Prompt Robustness Testing - Stress-Testing LLM Reliability

Author: Venkata Sudhakar

ShopMax India needs its product assistant to give stable answers even when customers phrase questions with typos, reordering, or different wording. Prompt robustness testing systematically generates semantically equivalent variations of each input prompt and measures consistency of LLM outputs to identify brittle behavior before it reaches production.

The approach generates paraphrased and perturbed versions of each test prompt, runs them through the LLM, and scores output consistency using cosine similarity of sentence embeddings. A high average similarity across variations means the LLM is robust to input phrasing. Low similarity indicates the model is sensitive to exact wording and may give inconsistent answers to the same underlying question.

The example below tests prompt robustness for ShopMax India's return policy FAQ. It runs 4 rephrased variations of the same question and scores each output against the base answer using semantic similarity.

from openai import OpenAI
from sentence_transformers import SentenceTransformer, util

client = OpenAI(api_key="your-api-key")
encoder = SentenceTransformer("all-MiniLM-L6-v2")

def get_answer(question):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return resp.choices[0].message.content

def similarity(a, b):
    ea = encoder.encode(a, convert_to_tensor=True)
    eb = encoder.encode(b, convert_to_tensor=True)
    return float(util.cos_sim(ea, eb)[0][0])

base = "What is the return policy for electronics at ShopMax India?"
variations = [
    "How can I return an electronics item bought at ShopMax India?",
    "What is ShopMax India return policy for electronics?",
    "Can I return electronics at ShopMax India and how?",
    "return policy shopmax india electronics??"
]

base_answer = get_answer(base)
scores = []
print("Prompt Robustness Test - ShopMax India")
print("=" * 40)
print(f"Base Q : {base}")
print(f"Base A : {base_answer[:60]}...")
print()

for v in variations:
    answer = get_answer(v)
    score = similarity(base_answer, answer)
    scores.append(score)
    label = "ROBUST" if score >= 0.80 else "BRITTLE"
    print(f"[{label}] {v[:50]}")
    print(f"         Similarity: {score:.3f}")

avg = sum(scores) / len(scores)
print(f"\nAverage Robustness: {avg:.3f}")

It gives the following output,

Prompt Robustness Test - ShopMax India
========================================
Base Q : What is the return policy for electronics at ShopMax India?
Base A : ShopMax India offers a 7-day return policy for electronics...

[ROBUST] How can I return an electronics item bought at ShopMax
         Similarity: 0.891
[ROBUST] What is ShopMax India return policy for electronics?
         Similarity: 0.912
[ROBUST] Can I return electronics at ShopMax India and how?
         Similarity: 0.873
[BRITTLE] return policy shopmax india electronics??
         Similarity: 0.762

Average Robustness: 0.860

In production, run robustness tests on every new prompt template before deploying it. Maintain a library of 5 to 10 canonical variations per question category and set an average similarity threshold of 0.82. Prompts that score below threshold need reformulation - typically adding explicit instructions like output format requirements or context constraints. Schedule robustness regression runs weekly to catch silent degradation after model updates.

Send your comments, suggestions or queries regarding this site to [email protected].