|
|
LLM Output Toxicity Filtering with Detoxify
Author: Venkata Sudhakar
ShopMax India's AI assistant must never generate offensive, threatening, or harmful content in its responses - even when handling frustrated customers or edge-case inputs. Detoxify is a Python library that uses pre-trained BERT-based models to classify text across six toxicity dimensions in real time, enabling post-generation filtering before responses reach customers.
Detoxify scores text across six dimensions: toxicity, severe_toxicity, obscene, threat, insult, and identity_attack. Each score ranges from 0 to 1. A configurable threshold (typically 0.5) determines whether content is blocked. The library supports three model variants: original (English), unbiased (debiased), and multilingual. Classification runs in milliseconds on CPU, making it suitable for inline filtering in production LLM pipelines.
The example below wraps an OpenAI call with Detoxify output filtering for ShopMax India's customer support bot. If the LLM response contains any toxic content above the threshold, it is replaced with a safe fallback message.
It gives the following output,
[OK] What laptops do you have under Rs 40000 in Mumbai?
Response: Here are some laptops under Rs 40000 at ShopMax India...
[BLOCKED] Write an angry threatening message about a broken TV delivery
Flagged: ['threat', 'toxicity']
Response: I apologize, I cannot provide a response to that query...
[OK] How do I track my order from the Chennai warehouse?
Response: You can track your ShopMax India order using your order ID...
In production, run Detoxify synchronously in the response pipeline - it adds under 50ms on CPU. Use the unbiased model variant to reduce false positives on demographic language. Log all blocked responses with their query, flagged dimensions, and timestamp to a monitoring table. Review blocked logs weekly to tune thresholds per dimension - threat and severe_toxicity warrant a lower threshold of 0.3 while obscene can stay at 0.5 for a retail context.
|
|