Key word extraction for short text via word2vec, doc2vec, and textrank

Li, Jun; Huang, Guimin; Fan, Chunli; Sun, Zhenglin; Zhu, Hongqing

doi:10.3906/elk-1806-38

Cited by 43 publications

(34 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the same fashion, the use of various feature extraction techniques has proven to improve classification accuracy. Text mining has many feature extraction methods but term frequency (TF), inverse document frequency (IDF), TF-IDF, word2vec and doc2vec are among the most commonly used feature extraction techniques [ 24 ]. The authors of [ 25 ] investigated the use of TF, IDF, and TF-IDF with linear classifiers including SVM, LR, and perceptron with a native language identification system.…”

Section: Literature Reviewmentioning

confidence: 99%

Tweets Classification on the Base of Sentiments for US Airline Companies

Rustam

Ashraf

Mehmood

et al. 2019

Entropy

151

View full text Add to dashboard Cite

The use of data from social networks such as Twitter has been increased during the last few years to improve political campaigns, quality of products and services, sentiment analysis, etc. Tweets classification based on user sentiments is a collaborative and important task for many organizations. This paper proposes a voting classifier (VC) to help sentiment analysis for such organizations. The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction. Tweets were classified into positive, negative and neutral classes based on the sentiments they contain. In addition, a variety of machine learning classifiers were evaluated using accuracy, precision, recall and F1 score as the performance metrics. The impact of feature extraction techniques, including term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word2vec, on classification accuracy was investigated as well. Moreover, the performance of a deep long short-term memory (LSTM) network was analyzed on the selected dataset. The results show that the proposed VC performs better than that of other classifiers. The VC is able to achieve an accuracy of 0.789, and 0.791 with TF and TF-IDF feature extraction, respectively. The results demonstrate that ensemble classifiers achieve higher accuracy than non-ensemble classifiers. Experiments further proved that the performance of machine learning classifiers is better when TF-IDF is used as the feature extraction method. Word2vec feature extraction performs worse than TF and TF-IDF feature extraction. The LSTM achieves a lower accuracy than machine learning classifiers.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Tweets Classification on the Base of Sentiments for US Airline Companies

Rustam

Ashraf

Mehmood

et al. 2019

Entropy

151

View full text Add to dashboard Cite

show abstract

“…The data is scraped directly from the news portal website; hence, it has punctuation symbols and many HTML related tags. There are various text pre-processing methods, including but not limited to converting capital letters to lowercase letters (case folding), clearing symbols, and punctuation marks [12]. We used a 70-30 split scheme for training and validation data sets after the data preprocessing operations.…”

Section: Data Preprocessingmentioning

confidence: 99%

Deep Feature Generation for Author Identification

Ozan¹

2021

Celal Bayar Üniversitesi Fen Bilimleri Dergisi

View full text Add to dashboard Cite

Identifying the authors of a given set of text is a well addressed and complicated task. It requires thorough knowledge of different authors' writing styles and discriminating them. As the main contribution of this paper, we propose to perform this task using machine learning and deep learning methods, state-of-the-art algorithms, and methods used in numerous complex Natural Language Processing (NLP) problems. We used a text corpus of daily newspaper columns written by thirty authors to perform our experiments. The experimental results proved that document embeddings trained via neural network architecture achieve cutting edge accuracy in learning writing styles and identifying authors of given writings even though the dataset has a considerably unbalanced distribution. We represent our experimental results and outsource our codes for interested readers and natural language processing (NLP) enthusiasts as a GitHub repository. They can reproduce and confirm the results and modify them according to their own needs.

show abstract

“…As opinion texts are short and sparse, this sparse representation and their high dimensions have posed a major challenge to the clustering of such texts [20]. Using Word2Vec and Doc2Vec as text representation models solves the problem of sparse display of short texts, and to some extent, it also solves the problems regarding the representation of high-dimensional texts [21], but these methods represent the text with 200-500 dimensions where there is still the problem of high dimensions.…”

Section: Introductionmentioning

confidence: 99%

Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis

Gudakahriz

Moghadam

Mahmoudi

2021

Scientific Programming

View full text Add to dashboard Cite

Nowadays, opinion texts are quickly published on websites and social networks by various users in the form of short texts and also in high volumes and various fields. Because these texts reflect the opinions of many users, their processing and analysis, such as clustering, can be very useful in a variety of applications including politics, industry, commerce, and economics. High dimensions of the text representation decrease efficiency of clustering, and an effective solution for this challenge is reducing dimensions of texts. Manifold learning is a powerful tool for nonlinear dimension reduction of high-dimensional data. Therefore, in this paper, for increasing efficiency of opinion texts clustering, by manifold learning, dimensions of the represented opinion texts are reduced based on sentiment and semantics, and their intrinsic dimensions are extracted. Then, the clustering algorithm is applied to dimension-reduced opinion texts. The proposed approach helps us to cluster opinion texts with simultaneous consideration of sentiment and semantics, which has received very little attention in the previous works. This type of clustering helps users of opinion texts to obtain more useful information from texts and also provides more accurate summaries in applications, such as the summarization of opinion texts. Experimental results on three datasets show better performance of the proposed approach on opinion texts in terms of important measures for evaluating clustering efficiency. An improvement of about 9% is observed in terms of accuracy on the third dataset and clustering based on sentiment and semantics.

show abstract

Key word extraction for short text via word2vec, doc2vec, and textrank

Cited by 43 publications

References 13 publications

Tweets Classification on the Base of Sentiments for US Airline Companies

Tweets Classification on the Base of Sentiments for US Airline Companies

Deep Feature Generation for Author Identification

Opinion Texts Clustering Using Manifold Learning Based on Sentiment and Semantics Analysis

Contact Info

Product

Resources

About