Online texts with toxic contents are a clear threat to the users on social media in particular and society in general. Although many platforms have adopted various measures (e.g., machine learning based hate-speech detection systems) to diminish their effect, toxic content writers have also attempted to evade such measures by using cleverly modified toxic words, so-called human-written text perturbations. Therefore, to help build AI-based detection to recognize those perturbations, prior methods have developed sophisticated techniques to generate diverse adversarial samples. However, we note that these algorithms-generated perturbations do not necessarily capture all the traits of human-written perturbations. Therefore, in this paper, we introduce a benchmark test set of human-written perturbations, named as NoisyHate, created from real perturbations written by human users on various social platforms, for helping develop better toxic speech detection models. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as Perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.
CCS CONCEPTS• Security and privacy → Social aspects of security and privacy;• Computing methodologies → Language resources; • Information systems → Social tagging.