To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Kübler, Sandra; Liu, Can; Sayyed, Zeeshan Ali

doi:10.1017/s1351324917000298

Cited by 20 publications

(9 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also added three emoji sentiment features, which consist of the positive, negative, and overall sentiment scores based on the Emoji Sentiment Ranking (Novak et al, 2015). We performed feature selection for the n-gram features using a filtering approach with information gain, which has proven to be effective in social media sentiment classification (Kübler et al, 2018).…”

Section: Model Detailsmentioning

confidence: 99%

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs

Zhu¹,

Tian²,

Kübler³

2019

Proceedings of the 13th International Workshop on Semantic Evaluation

Self Cite

View full text Add to dashboard Cite

This paper describes the UM-IU@LING's system for the SemEval 2019 Task 6: OffensEval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F 1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 submissions. In subtasks B and C, we used a linear SVM with selected character n-gram features. For subtask C, our system could identify the target of abuse with a macro F 1 score of 0.5243, ranking it 27th out of 65 submissions.

show abstract

Section: Model Detailsmentioning

confidence: 99%

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs

Zhu¹,

Tian²,

Kübler³

2019

Proceedings of the 13th International Workshop on Semantic Evaluation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Supervised learning uses labeled data to build a classification model, which is subsequently used to predict class labels for (unlabeled) test data. Supervised learning techniques have extensively been used for sentiment analysis [7], [10], [27]- [30]. The limitation of such techniques, however, is the requirement of labeled data.…”

Section: Have Explored Twitter Datamentioning

confidence: 99%

A Cooperative Binary-Clustering Framework Based on Majority Voting for Twitter Sentiment Analysis

et al. 2020

View full text Add to dashboard Cite

Twitter sentiment analysis is a challenging problem in natural language processing. For this purpose, supervised learning techniques have mostly been employed, which require labeled data for training. However, it is very time consuming to label datasets of large size. To address this issue, unsupervised learning techniques such as clustering can be used. In this study, we explore the possibility of using hierarchical clustering for twitter sentiment analysis. Three hierarchical-clustering techniques, namely single linkage (SL), complete linkage (CL) and average linkage (AL), are examined. A cooperative framework of SL, CL and AL is built to select the optimal cluster for tweets wherein the notion of optimal-cluster selection is operationalized using majority voting. The hierarchical clustering techniques are also compared with k-means and two state-of-the-art classifiers (SVM and Naïve Bayes). The performance of clustering and classification is measured in terms of accuracy and time efficiency. The experimental results indicate that cooperative clustering based on majority voting approach is robust in terms of good quality clusters with tradeoff of poor time efficiency. The results also suggest that the accuracy of the proposed clustering framework is comparable to classifiers which is encouraging. INDEX TERMS Cooperative clustering, majority voting, sentiment analysis, twitter sentiment analysis.

show abstract

“…Secondly, it aims to filter out the noise and the less relevant features to avoid overfitting. According to [30], feature selection could be mainly categorized into the filter method and the wrapper method. The filter method would generally evaluate the features by assigning them a ranking score based on the distributional statistics in the data.…”

Section: Feature Selectionmentioning

confidence: 99%

“…The wrapper method on the other hand, would identify the optimal subset of the features using held-out data. However, since the number of subsets is exponential, the wrapper method is tremendously inefficient when a large feature set is involved, even with greedy algorithms [30]. Besides that, [31] pointed out that the filter method is generally faster compared to the wrapper method.…”

Section: Feature Selectionmentioning

confidence: 99%

Classification of Encouragement (Targhib) And Warning (Tarhib) Using Sentiment Analysis on Classical Arabic

AlHasani

Saad

Kassim

2018

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

The Holy Qur'an is the main religious text of Islam. The Qur'an has its own methods of Targhib (encouragement) and Tarhib (warning), which are important features of the Qur'an. Most of the Quranic verses would urge and encourage people to do right and good deeds, and also warn them from committing evil and bad deeds. The method of classifying a text into two opposing opinions has been applied previously in solving the problem of sentiment analysis. Currently, it is applied in identifying between Targhib (encouragement) and Tarhib (warning) verses in the Qur'an. Each verse of the Qur'an can be treated as either an encouragement, warning or neutral. The language of the Holy Qur'an is one of the most challenging natural languages in sentiment analysis. The aim of this work is to classify the verses of encouragement and warning using sentiment analysis and NLP techniques. Several approaches are used in the Sentiment Analysis classification, such as the machine learning approach, the lexicon-based approach and the hybrid approach. In carrying out this aim, the applied machine learning approach was used, where the impact of the use of different techniques such as POS tagging, N-Gram and Feature selection with correlation based were evaluated and investigated. 95.6% accuracy was achieved using Naïve Bayes (NB) and 91.5% accuracy was achieved using the Support Vector Machines (SVM). This study is a significant study in extracting information and knowledge from the Holy Qur'an. It is significant for both researchers in the field of Islamic studies as well as non-specialized researchers.

show abstract

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Cited by 20 publications

References 23 publications

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs

A Cooperative Binary-Clustering Framework Based on Majority Voting for Twitter Sentiment Analysis

Classification of Encouragement (Targhib) And Warning (Tarhib) Using Sentiment Analysis on Classical Arabic

Contact Info

Product

Resources

About