|
|
Jailbreak Detection and Prevention in LLM Applications
Author: Venkata Sudhakar
ShopMax India's customer-facing chatbot accepts free-text queries from users across its platform. Without safeguards, malicious users can craft jailbreak prompts designed to override system instructions, extract internal data, or generate harmful content. Jailbreak detection adds a lightweight classification layer before each LLM call, blocking adversarial inputs before they reach the model.
A guard model evaluates the incoming user message to determine whether the input attempts to override instructions, extract confidential information, or trigger policy-violating outputs. ShopMax India uses a two-stage approach: fast regex matching for known patterns, followed by an LLM-based classifier for ambiguous inputs. Caching the LLM guard results for repeated patterns minimises added latency.
The example below implements a two-stage guard that first checks for known jailbreak patterns with regex, then uses a secondary LLM call for ambiguous inputs before routing to the main chatbot.
It gives the following output,
Request blocked (policy violation). Reason: ignore (all |previous |above |your )?instructions
For TVs under Rs 40,000 in Hyderabad, ShopMax India recommends:
1. Samsung 43-inch 4K UHD (Rs 32,990)
2. LG 43-inch NanoCell (Rs 38,500)
Both are available for same-day delivery in Hyderabad.
Log all blocked requests with the matched pattern and session ID for security audits. Review blocked request logs weekly to identify new jailbreak patterns and add them to the regex list. Apply a rate limit on users who trigger multiple blocks in a single session - three blocks within an hour should trigger a temporary ban. Never rely solely on the LLM guard - always combine it with regex for known patterns since regex is faster and cheaper.
|
|