2020
DOI: 10.1145/3405843
|View full text |Cite
|
Sign up to set email alerts
|

The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts

Abstract: In this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears whe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…Data preprocessing: Data preprocessing is considered as an essential step in machine learning and data mining( [25] ; [26] ; [27] ; [28]). The reviews usually contain incomplete sentences, much noise, and weak wording such as words without application with high repetition, imperfect words, and incorrect grammar.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…Data preprocessing: Data preprocessing is considered as an essential step in machine learning and data mining( [25] ; [26] ; [27] ; [28]). The reviews usually contain incomplete sentences, much noise, and weak wording such as words without application with high repetition, imperfect words, and incorrect grammar.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…In most natural language processing applications, words are used as features. The most popular word vector representations are distributed representation and one-hot representation [27,47]. However, the one-hot representation has various problems, such as the too-large vector dimension, the sparsity of the word vector, and ignoring the word semantic association.…”
Section: Embedding (Word Representation)mentioning
confidence: 99%
“…The text clustering techniques mostly objective to create text papers clusters related to the papers with the basis of intrinsic contents. once start clustering method, the text documents should be procced with the pre-processing methods such as tokenization [21], removal of stop words and stemming [22] process. Hence, the text documents are changed into a required format.…”
Section: Pre-processingmentioning
confidence: 99%