|
|
AI Model Poisoning - Detection and Prevention Strategies
Author: Venkata Sudhakar
ShopMax India fine-tunes language models on proprietary product data and customer interaction logs to improve recommendation quality. Fine-tuning datasets assembled from multiple sources carry a risk of data poisoning - malicious or corrupted examples embedded in the training data that cause the model to behave unexpectedly in production. Detecting poisoned samples before fine-tuning prevents backdoor behaviours and biased outputs from reaching customers.
Data poisoning attacks inject training examples that teach the model to produce specific outputs when it encounters a trigger phrase or pattern. Detection strategies include embedding-based outlier analysis, near-duplicate detection, label consistency checks, and loss spike monitoring during training. ShopMax India applies embedding-based outlier detection to screen all fine-tuning data before it enters the training pipeline, flagging samples that are statistically distant from the corpus centroid.
The example below screens a ShopMax India fine-tuning dataset by computing embedding distances and flagging samples that are statistical outliers - potential poisoned or off-topic examples.
It gives the following output,
Mean distance: 0.1823 | Threshold (2-sigma): 0.3241
[OK] Sample 0: Best laptop under Rs 50000? (dist=0.1542)
[OK] Sample 1: Do you deliver to Chennai? (dist=0.1634)
[OK] Sample 2: What is the return policy? (dist=0.1721)
[OK] Sample 3: Recommend headphones under Rs 5000. (dist=0.1689)
[FLAGGED] Sample 4: Tell me about competitor pricing. (dist=0.3812)
[OK] Sample 5: Best TV under Rs 30000? (dist=0.1539)
Apply outlier detection at a 2-sigma threshold for a good balance of sensitivity and false positive rate. Manually review all flagged samples before removing them - some outliers are legitimate edge cases rather than poisoned data. Run loss spike analysis during training: if a batch shows a loss spike more than 3x the rolling average, log those samples for manual inspection. Maintain a data provenance log so every training sample can be traced back to its source for forensic investigation.
|
|