2019
DOI: 10.22190/fumi1905973d
|View full text |Cite
|
Sign up to set email alerts
|

The Influence of Text Preprocessing Methods and Tools on Calculating Text Similarity

Abstract: Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional represen-tation of text docu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 9 publications
0
8
0
Order By: Relevance
“…Familiar text preprocessing includes tokenization, case folding, stop word removal, stemming, and transformation. Input data for this research needed to be prepared as a suitable input for the base selected ML classifiers and the ensemble [31]. The steps for pre-processing are mostly the same as those for all base classifiers.…”
Section: Data Pre-processingmentioning
confidence: 99%
See 4 more Smart Citations
“…Familiar text preprocessing includes tokenization, case folding, stop word removal, stemming, and transformation. Input data for this research needed to be prepared as a suitable input for the base selected ML classifiers and the ensemble [31]. The steps for pre-processing are mostly the same as those for all base classifiers.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…A token is a group of letters joined with a semantic meaning with no need for further processing. Different tokenization methods can be applied to a text, so it is important to use the same technique for all texts used in an experiment [31] •…”
mentioning
confidence: 99%
See 3 more Smart Citations