2019
DOI: 10.3906/elk-1806-38
|View full text |Cite
|
Sign up to set email alerts
|

Key word extraction for short text via word2vec, doc2vec, and textrank

Abstract: The rapid development of social media encourages people to share their opinions and feelings on the Internet.Every day, a large number of short text comments are generated through Twitter, microblogging, WeChat, etc., and there is high commercial and social value in extracting useful information from these short texts. At present, most studies have focused on extracting text key words. For example, the LDA topic model has good performance with long texts, but it loses effectiveness with short texts because of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 43 publications
(34 citation statements)
references
References 13 publications
0
34
0
Order By: Relevance
“…In the same fashion, the use of various feature extraction techniques has proven to improve classification accuracy. Text mining has many feature extraction methods but term frequency (TF), inverse document frequency (IDF), TF-IDF, word2vec and doc2vec are among the most commonly used feature extraction techniques [ 24 ]. The authors of [ 25 ] investigated the use of TF, IDF, and TF-IDF with linear classifiers including SVM, LR, and perceptron with a native language identification system.…”
Section: Literature Reviewmentioning
confidence: 99%
“…In the same fashion, the use of various feature extraction techniques has proven to improve classification accuracy. Text mining has many feature extraction methods but term frequency (TF), inverse document frequency (IDF), TF-IDF, word2vec and doc2vec are among the most commonly used feature extraction techniques [ 24 ]. The authors of [ 25 ] investigated the use of TF, IDF, and TF-IDF with linear classifiers including SVM, LR, and perceptron with a native language identification system.…”
Section: Literature Reviewmentioning
confidence: 99%
“…The data is scraped directly from the news portal website; hence, it has punctuation symbols and many HTML related tags. There are various text pre-processing methods, including but not limited to converting capital letters to lowercase letters (case folding), clearing symbols, and punctuation marks [12]. We used a 70-30 split scheme for training and validation data sets after the data preprocessing operations.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…As opinion texts are short and sparse, this sparse representation and their high dimensions have posed a major challenge to the clustering of such texts [20]. Using Word2Vec and Doc2Vec as text representation models solves the problem of sparse display of short texts, and to some extent, it also solves the problems regarding the representation of high-dimensional texts [21], but these methods represent the text with 200-500 dimensions where there is still the problem of high dimensions.…”
Section: Introductionmentioning
confidence: 99%