Due to the significant increase in Internet activity since the COVID-19 epidemic, many informal, unstructured, offensive, and even misspelled textual content has been used for online communication through various social media. The Bengali and Banglish(Bengali words written in English format) offensive texts have recently been widely used to harass and criticize people on various social media. Our deep excavation reveals that limited work has been done to identify offensive Bengali texts. In this study, we have engineered a detection mechanism using natural language processing to identify Bengali and Banglish offensive messages in social media that could abuse other people. First, different classifiers have been employed to classify the offensive text as baseline classifiers from real-life datasets. Then, we applied boosting algorithms based on baseline classifiers. AdaBoost is the most effective ensemble method called adaptive boosting, which enhances the outcomes of the classifiers. The long short-term memory (LSTM) model is used to eliminate long-term dependency problems when classifying text, but overfitting problems occur. AdaBoost has strong forecasting ability and overfitting problem does not occur easily. By considering these two powerful and diverse models, we propose L-Boost, the modified AdaBoost algorithm using bidirectional encoder representations from transformers (BERT) with LSTM models. We tested the L-Boost model on three separate datasets, including the BERT pretrained word-embedding vector model. We find our proposed L-Boost's efficacy better than all the baseline classification algorithms reaching an accuracy of 95.11%.INDEX TERMS Offensive text, social media harassment, natural language processing, ensemble learning, BERT model.
Offensive messages on social media, have recently been frequently used to harass and criticize people. In recent studies, many promising algorithms have been developed to identify offensive texts. Most algorithms analyze text in a unidirectional manner, where a bidirectional method can maximize performance results and capture semantic and contextual information in sentences. In addition, there are many separate models for identifying offensive texts based on monolingual and multilingual, but there are a few models that can detect both monolingual and multilingual-based offensive texts. In this study, a detection system has been developed for both monolingual and multilingual offensive texts by combining deep convolutional neural network and bidirectional encoder representations from transformers (Deep-BERT) to identify offensive posts on social media that are used to harass others. This paper explores a variety of ways to deal with multilingualism, including collaborative multilingual and translation-based approaches. Then, the Deep-BERT is tested on the Bengali and English datasets, including the different bidirectional encoder representations from transformers (BERT) pre-trained word-embedding techniques, and found that the proposed Deep-BERT's efficacy outperformed all existing offensive text classification algorithms reaching an accuracy of 91.83%. The proposed model is a state-of-the-art model that can classify both monolingual-based and multilingual-based offensive texts.
Fake news detection techniques are a topic of interest due to the vast abundance of fake news data accessible via social media. The present fake news detection system performs satisfactorily on well-balanced data. However, when the dataset is biased, these models perform poorly. Additionally, manual labeling of fake news data is time-consuming, though we have enough fake news traversing the internet. Thus, we introduce a text augmentation technique with a Bidirectional Encoder Representation of Transformers (BERT) language model to generate an augmented dataset composed of synthetic fake data. The proposed approach overcomes the issue of minority class and performs the classification with the AugFake-BERT model (trained with an augmented dataset). The proposed strategy is evaluated with twelve different state-of-the-art models. The proposed model outperforms the existing models with an accuracy of 92.45%. Moreover, accuracy, precision, recall, and f1-score performance metrics are utilized to evaluate the proposed strategy and demonstrate that a balanced dataset significantly affects classification performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.