Opinion Mining (OM) is a field of Natural Language Processing (NLP) that aims to capture human sentiment in the given text. With the ever-spreading of online purchasing websites, micro-blogging sites, and social media platforms, OM in online social media platforms has picked the interest of thousands of scientific researchers. Because the reviews, tweets and blogs acquired from these social media networks, act as a significant source for enhancing the decision making process. The obtained textual data (reviews, tweets, or blogs) are classified into three different class labels which are negative, neutral and positive for analyzing and extracting relevant information from the given dataset. In this contribution, we introduce an innovative MapReduce improved weighted ID3 decision tree classification approach for OM, which consists mainly of three aspects: Firstly We have used several feature extractors to efficiently detect and capture the relevant data from the given tweets, including N-grams or character-level, Bag-Of-Words, word embedding (GloVe, Word2Vec), FastText, and TF-IDF. Secondly, we have applied a multiple feature selector to reduce the high feature's dimensionality, including Chi-square, Gain Ratio, Information Gain, and Gini Index. Finally, we have employed the obtained features to carry out the classification task using an improved ID3 decision tree classifier, which aims to calculate the weighted information gain instead of information gain used in traditional ID3. In other words, to measure the weighted information gain for the current conditioned feature, we follow two steps: First, we compute the weighted correlation function of the current conditioned feature. Second, we multiply the obtained weighted correlation function by the information gain of this current conditioned feature. This work is implemented in a distributed environment using the Hadoop framework, with its programming framework MapReduce and its distributed file system HDFS. Its primary goal is to enhance the performance of a well-known ID3 classifier in terms of accuracy, execution time, and ability to handle the massive datasets. We have carried out several experiences that aims to assess the effectiveness of our suggested classifier compared to some other contributions chosen from the literature. The experimental results demonstrated that our ID3 classifier works better on COVID-19_Sentiments dataset than other classifiers in terms of Recall (85.72 %), specificity (86.51 %), error rate (11.18 %), false-positive rate (13.49 %), execution time (15.95s), kappa statistic (87.69 %), F1-score (85.54 %), classification rate (88.82 %), false-negative rate (14.28 %), precision rate (86.67 %), convergence (it convergent towards the iteration 90), stability (it is more stable with mean deviation standard equal to 0.12 %), and complexity (it requires much lower time and space computational complexity).
This contribution proposes a new model for sentiment analysis, which combines the convolutional neural network (CNN), C4.5 decision tree algorithm, and Fuzzy Rule-Based System (FRBS). Our suggested method consists of six parts. Firstly we have applied several pre-processing techniques. Secondly, we have used the fastText method for vectoring the analysed tweets. Thirdly, we have implemented the CNN for extracting and selecting the pertinent features from the tweets. Fourthly, we have fuzzified the CNN output using the Gaussian Fuzzification (GF) method for coping with vague data. Then we have applied fuzziness C4.5 for creating the fuzziness rules. Finally, we have used the General Fuzziness Reasoning (GFR) approach for classifying the new tweets. In summary, our method integrates the advantages of CNN and C4.5 techniques and overcomes the shortcomings of ambiguous data in the tweets using FRBS, which is consists of three-phase: fuzzification phase using GF, inference mechanism using fuzziness C4.5, and defuzzification phase using GFR. Also, to give our approach the ability to deal with the massive data, we have implemented it on the Hadoop framework of five computers. The experiential findings confirmed that our model operates excellently compared to other chosen models form the literature.
Sentiment analysis is the process of recognizing and categorizing the emotions being expressed in a textual source. Tweets are commonly used to generate a large amount of sentiment data after they are analyzed. These feelings data help to learn about people's thoughts on a various range of topics. People are typically attracted for researching positive and negative reviews, which contain dislikes and likes, shared by the consumers concerning the features of a certain service or product. Therefore, the aspects or features of the product/ service play an important role in opinion mining. Furthermore to enough work being carried out in text mining, feature extraction in opinion mining is presently becoming a hot research field. In this paper, we focus on the study of feature extractors because of their importance in classification performance. The feature extraction is the most critical aspect of opinion classification since classification efficiency can be degraded if features are not properly chosen. A few scientific researchers have addressed the issue of feature extraction. And we found in the literature that almost every article deals with one or two feature extractors. For that, we decided in this paper to cover all the most popular feature extractors which are BOW, N-grams, TF-IDF, Word2vec, GloVe and FastText. In general, this paper will discuss the existing feature extractors in the opinion mining domain. Also, it will present the advantages and the inconveniences of each extractor. Moreover, a comparative study is performed for determining the most efficient combination CNN/extractor in terms of accuracy, precision, recall, and F1 measure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.