MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data

Hasib, Khan Md; Azam, Sami; Karim, Asif; Marouf, Ahmed Al; Shamrat, F M Javed Mehedi; Montaha, Sidratul; Yeo, Kheng Cher; Jonkman, Mirjam; Alhajj, Reda; Rokne, Jon G.

doi:10.1109/access.2023.3309697

Cited by 29 publications

(4 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The methods of handling imbalanced data can be divided into algorithmic-level and data-level methods [41]. Algorithmic-level methods focus on designing new classification algorithms or enhancing existing ones (for example, [42][43][44]), while data-level methods attempt to balance the data by reducing the majority class or expanding the minority class.…”

Section: Handling Class Imbalancementioning

confidence: 99%

GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts

Zakharova,

Glazkova

2024

Applied Sciences

View full text Add to dashboard Cite

Green practices are social practices that aim to harmonize the relations between people and the natural environment. They may involve minimizing the use of resources and the generation of waste and emissions. Detecting green practices in social media posts helps to understand which green practices are currently common and to develop recommendations on the scaling of green practices to reduce environmental problems. This paper describes GreenRu, a novel Russian social media dataset for detecting the mentions of green practices related to waste management. It has a sentence-level markup and consists of 1326 posts collected in Russian online communities. The total number of mentions of green waste practices is 3765. The paper assessed the effectiveness of the multi-label and one-versus-rest BERT-based models for detecting the mentions of green practices in social media posts and compared several data augmentation methods in terms of both classification metrics and human evaluation. To augment the dataset, a backtranslation method and generative language models, such as RuGPT, RuT5, and ChatGPT, were used in this study. The results enable researchers to monitor the green waste practices on social networks and develop environmental policies. Additionally, GreenRu can support machine learning models to analyze social media content, assess the prevalence and effectiveness of green waste practices, and identify ways to expand them.

show abstract

Section: Handling Class Imbalancementioning

confidence: 99%

GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts

Zakharova,

Glazkova

2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Data balancing is an important task to reduce model skewness and as such several works have made use of oversampling approaches to reduce model overfitting and increase performance [7][8][9]. Sarakit et al [10] employed the SMOTE method to detect emotion in unbalanced YouTube datasets using three machine learning classifiers.…”

Section: Related Workmentioning

confidence: 99%

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Mujahid,

Kına,

Rustam

et al. 2024

J Big Data

View full text Add to dashboard Cite

The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.

show abstract

“…Their approach utilizes a range of machine learning and deep learning models, with BERT reaching a maximum accuracy of 99.04% in balanced datasets and 72.23% in imbalanced datasets. Another noteworthy contribution by (Hasib et al, 2023a) introduces MCNN-LSTM, a novel fusion of CNN and LSTM for news text classification. After balancing the dataset using the Tomek-Link algorithm, their model attains remarkable performance, achieving a 98% F1-score and 99.71% accuracy compared to prior research.…”

Section: Handling Class Imbalancementioning

confidence: 99%

Embeddings at BLP-2023 Task 2: Optimizing Fine-Tuned Transformers with Cost-Sensitive Learning for Multiclass Sentiment Analysis

Tonmoy

2023

Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

View full text Add to dashboard Cite

In this study, we address the task of Sentiment Analysis for Bangla Social Media Posts, introduced in first Workshop on Bangla Language Processing (Hasan et al., 2023a). Our research encountered two significant challenges in the context of sentiment analysis. The first challenge involved extensive training times and memory constraints when we chose to employ oversampling techniques for addressing class imbalance in an attempt to enhance model performance. Conversely, when opting for undersampling, the training time was optimal, but this approach resulted in poor model performance. These challenges highlight the complex trade-offs involved in selecting sampling methods to address class imbalances in sentiment analysis tasks. We tackle these challenges through cost-sensitive approaches aimed at enhancing model performance. In our initial submission during the evaluation phase, we ranked 9th out of 30 participants with an F1-micro score of 0.7088 . Subsequently, through additional experimentation, we managed to elevate our F1-micro score to 0.7186 by leveraging the BanglaBERT-Large model in combination with the Self-adjusting Dice loss function. Our experiments highlight the effect in performance of the models achieved by modifying the loss function. Our experimental data and source code can be found here. 1

show abstract

MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data

Cited by 29 publications

References 52 publications

GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts

GreenRu: A Russian Dataset for Detecting Mentions of Green Practices in Social Media Posts

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Embeddings at BLP-2023 Task 2: Optimizing Fine-Tuned Transformers with Cost-Sensitive Learning for Multiclass Sentiment Analysis

Contact Info

Product

Resources

About