Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM

Ekinci, Ekin

doi:10.35377/saucis...1070822

Cited by 6 publications

(4 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lastly, Ekinci [39] performed a comparative study of imbalanced offensive data classification using an LSTM-based sentence generation method. Various classifiers were trained using TF-IDF and Word2vec for text representation, demonstrating the value of sentence generation methods in handling imbalanced sentiment analysis tasks.…”

Section: Imbalanced Sentiment Analysismentioning

confidence: 99%

Mitigating Class Imbalance in Sentiment Analysis through GPT-3-Generated Synthetic Sentences

Suhaeni,

Yong

2023

Applied Sciences

View full text Add to dashboard Cite

In this paper, we explore the effectiveness of the GPT-3 model in tackling imbalanced sentiment analysis, focusing on the Coursera online course review dataset that exhibits high imbalance. Training on such skewed datasets often results in a bias towards the majority class, undermining the classification performance for minority sentiments, thereby accentuating the necessity for a balanced dataset. Two primary initiatives were undertaken: (1) synthetic review generation via fine-tuning of the Davinci base model from GPT-3 and (2) sentiment classification utilizing nine models on both imbalanced and balanced datasets. The results indicate that good-quality synthetic reviews substantially enhance sentiment classification performance. Every model demonstrated an improvement in accuracy, with an average increase of approximately 12.76% on the balanced dataset. Among all the models, the Multinomial Naïve Bayes achieved the highest accuracy, registering 75.12% on the balanced dataset. This study underscores the potential of the GPT-3 model as a feasible solution for addressing data imbalance in sentiment analysis and offers significant insights for future research.

show abstract

Section: Imbalanced Sentiment Analysismentioning

confidence: 99%

Mitigating Class Imbalance in Sentiment Analysis through GPT-3-Generated Synthetic Sentences

Suhaeni,

Yong

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Skip-gram which is an n-gram based model is a learning model in Word2vec [38]. Skip-gram model realized a neural network (NN) architecture and has three layers namely input, projection, and output.…”

Section: Extraction Of Word Embeddingsmentioning

confidence: 99%

An alternative word embedding approach for knowledge representation in online consumers’ reviews

Ekinci,

İlhan Omurca

2023

Pamukkale J Eng Sci

View full text Add to dashboard Cite

Purchasing decisions in e-commerce shopping websites are highly influenced by online reviews. Although online reviews contain finegrained consumers' opinions that reflect their preferences towards products; an important challenge, is that the number of online reviews can be very huge for fast and effective analysis. Hence, discovering the thematic structure of documents plays an important role in analyzing online reviews. The proposed system in this paper aims to discover the main consumer interests in online reviews on Turkish e-commerce websites. For this aim, a novel hybrid method combining Latent Dirichlet Allocation (LDA) and word2vec is proposed. Finally, we compare the performance of our work with those of several state-of-theart baselines on 7 datasets collected from well-known Turkish ecommerce websites. The experimental results show how our proposed approach was able to provide significantly improved performance over baselines. Besides, our method enables us to discover very specific topics complying with consumer interests. E-ticaret alışveriş sitelerinde satın alma kararları, çevrimiçi yorumlardan oldukça etkilenir. Çevrimiçi yorumlar, ürünlere yönelik tercihleri yansıtan ayrıntılı tüketici görüşleri içerse de; önemli bir zorluk, çevrimiçi yorumların miktarının hızlı ve etkili bir analiz için çok büyük olabileceğidir. Bu nedenle, belgelerin tematik yapısını keşfetmek, çevrimiçi yorumları analiz etmede önemli bir rol oynar. Bu çalışmada önerilen sistem, Türk e-ticaret web sitelerindeki çevrimiçi yorumlardaki tüketicilerin ana ilgi alanlarını keşfetmeyi amaçlamaktadır. Bu amaçla, Gizli Dirichlet Ayırımı (GDA) ve word2vec'i birleştiren yeni bir hibrit yöntem önerilmiştir. Son olarak, çalışmamızın performansını, güncel yöntemlerin performansıyla tanınmış Türk e-ticaret sitelerinden toplanan 7 veri kümesi üzerinden karşılaştırdık. Deneysel sonuçlar, önerilen yaklaşımımızın güncel yöntemlere göre önemli ölçüde gelişmiş performans sağlayabildiğini göstermektedir. Ayrıca yöntemimiz, tüketici ilgi alanlarına uygun çok özel konuları keşfetmeyi sağlar.

show abstract

“…In align previous research, this study proposes two additional classifiers, namely: k-Nearest Neighbors (KNN) and Long Short-Term Memory (LSTM), which will be tested on the Plant-Disease Relation (PDR) dataset. The main reason for using KNN and LSTM is that these algorithms are also proven to be used to solve unbalanced class problems like what was done by [38], [39], [40], [41]. Furthermore, reference [33] does not work on the KNN and LSTM algorithms, who used Linear SVC, RBF SVM, DTC, RF, LR, and MNB for multi-class text classification tasks.…”

Section: Introductionmentioning

confidence: 99%

Comparative Analysis using Various Performance Metrics in Imbalanced Data for Multi-class Text Classification

Riyanto¹,

Sitanggang²,

Djatna³

et al. 2023

IJACSA

View full text Add to dashboard Cite

Precision, Recall, and F1-score are metrics that are often used to evaluate model performance. Precision and Recall are very important to consider when the data is balanced, but in the case of unbalanced data the F1-score is the most important metric. To find out the importance of these metrics, a comparative analysis is needed in order to determine which metric is appropriate for the data being analyzed. This study aims to perform a comparative analysis of various evaluation metrics on unbalanced data in multi-class text classification. This study uses an unbalanced multi-class text dataset including: association, negative, cause of disease, and treatment of disease. This study involves five classifiers as algorithm-level approach, namely: Multinomial Naive Bayes, K-Nearest Neigbors, Support Vector Machine, Random Forest, and Long Short-Term Memory. Meanwhile, data-level approach, this study involves under sampling, over sampling, and synthetic minority oversampling technique. Several evaluation metrics used to evaluate model performance include Precision, Recall, and F1-score. The results show that the most suitable evaluation metric for use on unbalanced data depends on the purpose of use and the desired priority, including the classifier that is suitable for handling multi-class assignments on unbalanced data. The results of this study can assist practitioners in selecting evaluation metrics that are in accordance with the goals and application needs of multi-class text classification.

show abstract

Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM

Cited by 6 publications

References 59 publications

Mitigating Class Imbalance in Sentiment Analysis through GPT-3-Generated Synthetic Sentences

Mitigating Class Imbalance in Sentiment Analysis through GPT-3-Generated Synthetic Sentences

An alternative word embedding approach for knowledge representation in online consumers’ reviews

Comparative Analysis using Various Performance Metrics in Imbalanced Data for Multi-class Text Classification

Contact Info

Product

Resources

About