The increase in abusive content on online social media platforms is impacting the social life of online users. Use of offensive and hate speech has been making social media toxic. Homophobia and transphobia constitute offensive comments against LGBT+ community. It becomes imperative to detect and handle these comments, to timely flag or issue a warning to users indulging in such behaviour. However, automated detection of such content is a challenging task, more so in Dravidian languages which are identified as low resource languages. Motivated by this, the paper attempts to explore applicability of different deep learning models for classification of the social media comments in Malayalam and Tamil languages as homophobic, transphobic and non-anti-LGBT+content. The popularly used deep learning models-Convolutional Neural Network (CNN), Long Short Term Memory (LSTM) using GloVe embedding and transformer-based learning models (Multilingual BERT and Indic-BERT) are applied to the classification problem. Results obtained show that In-dicBERT outperforms the other implemented models, with obtained weighted average F1-score of 0.86 and 0.77 for Malayalam and Tamil, respectively. Therefore, the present work confirms higher performance of IndicBERT on the given task in selected Dravidian languages.
Social media has over the years provided a medium for creation and dissemination of opinions and thoughts through online platforms. While it allows users to express their views, sentiments and emotions, some people try to use it to generate and share unpleasant and hateful content. Such content is now referred to as hate speech and it may target an individual, a group, a community, or a country. During the last few years, several techniques have been developed to automatically detect and identify hate speech, offensive and abusive content from social media platforms. However, majority of the studies focused on hate speech detection in English language texts. With social media getting higher penetration across different geographies, there is now a significant amount of content generated in various languages. Though there have been significant advancements in algorithmic approaches for the task, the non-availability of suitable dataset in other languages poses a problem in research advancement in them. Hindi is one such widely spoken language where such datasets are not available. This work attempts to bridge this research gap by presenting a curated and annotated dataset for target-based hate speech (TABHATE) in the Hindi language. The dataset comprises of 2,020 tweets and is annotated by three independent annotators. A multiclass labelling is used where each tweet is labelled as: (i) individual targeting, (ii) community targeting, and (iii) none. Inter annotator agreement is computed. The suitability of dataset is then further explored by applying some standard deep learning and transformer-based models for the task of hate speech detection. The experimental results obtained show that the dataset can be used for experimental work on hate speech detection of Hindi language texts.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.