In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Guardrails and Evaluation > LLM Output Toxicity Filtering with Detoxify

LLM Output Toxicity Filtering with Detoxify

Author: Venkata Sudhakar

ShopMax India's AI assistant must never generate offensive, threatening, or harmful content in its responses - even when handling frustrated customers or edge-case inputs. Detoxify is a Python library that uses pre-trained BERT-based models to classify text across six toxicity dimensions in real time, enabling post-generation filtering before responses reach customers.

Detoxify scores text across six dimensions: toxicity, severe_toxicity, obscene, threat, insult, and identity_attack. Each score ranges from 0 to 1. A configurable threshold (typically 0.5) determines whether content is blocked. The library supports three model variants: original (English), unbiased (debiased), and multilingual. Classification runs in milliseconds on CPU, making it suitable for inline filtering in production LLM pipelines.

The example below wraps an OpenAI call with Detoxify output filtering for ShopMax India's customer support bot. If the LLM response contains any toxic content above the threshold, it is replaced with a safe fallback message.

from detoxify import Detoxify
from openai import OpenAI

client = OpenAI(api_key="your-api-key")
detector = Detoxify("original")
THRESHOLD = 0.5
FALLBACK = "I apologize, I cannot provide a response to that query. Please contact ShopMax India support at support@shopmax.in"

def check_toxicity(text):
    scores = detector.predict(text)
    flagged = [k for k, v in scores.items() if v > THRESHOLD]
    return flagged

def safe_respond(query):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    )
    output = resp.choices[0].message.content
    flagged = check_toxicity(output)
    if flagged:
        return {"status": "blocked", "dims": flagged, "response": FALLBACK}
    return {"status": "ok", "dims": [], "response": output}

queries = [
    "What laptops do you have under Rs 40000 in Mumbai?",
    "Write an angry threatening message about a broken TV delivery",
    "How do I track my order from the Chennai warehouse?"
]

for q in queries:
    result = safe_respond(q)
    status = result["status"].upper()
    print(f"[{status}] {q[:55]}")
    if result["dims"]:
        print(f"         Flagged: {result["dims"]}")
    print(f"         Response: {result["response"][:60]}...")
    print()

It gives the following output,

[OK] What laptops do you have under Rs 40000 in Mumbai?
         Response: Here are some laptops under Rs 40000 at ShopMax India...

[BLOCKED] Write an angry threatening message about a broken TV delivery
         Flagged: ['threat', 'toxicity']
         Response: I apologize, I cannot provide a response to that query...

[OK] How do I track my order from the Chennai warehouse?
         Response: You can track your ShopMax India order using your order ID...

In production, run Detoxify synchronously in the response pipeline - it adds under 50ms on CPU. Use the unbiased model variant to reduce false positives on demographic language. Log all blocked responses with their query, flagged dimensions, and timestamp to a monitoring table. Review blocked logs weekly to tune thresholds per dimension - threat and severe_toxicity warrant a lower threshold of 0.3 while obscene can stay at 0.5 for a retail context.

Send your comments, suggestions or queries regarding this site to [email protected].