The Influence of Text Preprocessing Methods and Tools on Calculating Text Similarity

Petrović, Đorđe; Stanković, Milena

doi:10.22190/fumi1905973d

Cited by 5 publications

(8 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Familiar text preprocessing includes tokenization, case folding, stop word removal, stemming, and transformation. Input data for this research needed to be prepared as a suitable input for the base selected ML classifiers and the ensemble [31]. The steps for pre-processing are mostly the same as those for all base classifiers.…”

Section: Data Pre-processingmentioning

confidence: 99%

“…A token is a group of letters joined with a semantic meaning with no need for further processing. Different tokenization methods can be applied to a text, so it is important to use the same technique for all texts used in an experiment [31] •…”

mentioning

confidence: 99%

“…Case folding is the process of unifying the cases of the letters in the entire text, but there can be some ambiguity if uppercase letters are used to distinguish different abbreviations [31].…”

mentioning

confidence: 99%

“…Stop words are the parts of sentences with negative effects on multiclassification problems. Stop words include prepositions, pronouns, adverbs, and conjunctions [31] •…”

mentioning

confidence: 99%

“…Stemming refers to extracting the morphological root of a word. Several different techniques are used for this process, including lemmatization, the use of semi-automatic lookup tables, and suffix stripping [31].…”

mentioning

confidence: 99%

See 4 more Smart Citations

An Ensemble Machine Learning Technique for Functional Requirement Classification

2020

View full text Add to dashboard Cite

In Requirement Engineering, software requirements are classified into two main categories: Functional Requirement (FR) and Non-Functional Requirement (NFR). FR describes user and system goals. NFR includes all constraints on services and functions. Deeper classification of those two categories facilitates the software development process. There are many techniques for classifying FR; some of them are Machine Learning (ML) techniques, and others are traditional. To date, the classification accuracy has not been satisfactory. In this paper, we introduce a new ensemble ML technique for classifying FR statements to improve their accuracy and availability. This technique combines different ML models and uses enhanced accuracy as a weight in the weighted ensemble voting approach. The five combined models are Naïve Bayes, Support Vector Machine (SVM), Decision Tree, Logistic Regression, and Support Vector Classification (SVC). The technique was implemented, trained, and tested using a collected dataset. The accuracy of classifying FR was 99.45%, and the required time was 0.7 s.

show abstract

Section: Data Pre-processingmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

“…Stop words are the parts of sentences with negative effects on multiclassification problems. Stop words include prepositions, pronouns, adverbs, and conjunctions [31] •…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

An Ensemble Machine Learning Technique for Functional Requirement Classification

2020

View full text Add to dashboard Cite

show abstract

Development of the information system for the Kazakh language preprocessing

et al. 2021

View full text Add to dashboard Cite

The aim of this work is the design and development of linguistic resources and preprocessing tools for the Kazakh language. The media-corpus of the Kazakh language is presented as a linguistic resource, which is available on Al-Farabi Kazakh National University platform. The media-corpus of the Kazakh language consists of texts of news content and is implemented as an information system. The general architecture of an information system for the automatic and reliable collection, storage and analysis of texts in the Kazakh language is described. Three automatic text preprocessing tools for the Kazakh language-word forms generator, morphological analyzer, and morphological disambiguation tool-are presented in the article. The proposed tools can also be applied in the systems of automatic analysis of texts, in creation of other linguistic resources such as thesauri and ontologies. ABOUT THE AUTHORS Darkhan Akhmed-Zaki is the Doctor of Technical Sciences, Professor, Rector of the Astana IT University. He is the author of over 200 scientific papers. His scientific research relates to the organization of distributed and parallel computing, program verification and data mining. He is the scientific supervisor of more than 20 international and Kazakhstan scientific projects.

show abstract

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

Marovac

Avdić

Milošević

2023

Preprint

View full text Add to dashboard Cite

The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbian challenging. Nevertheless, over the past three decades, there have been a number of initiatives to develop resources and methods for natural language processing of Serbian, ranging from developing a corpus of free text from books and the internet, annotated corpora for classification and named entity recognition tasks to various methods and models performing these tasks. In this paper, we review the initiatives, resources, methods, and their availability.

show abstract

The Influence of Text Preprocessing Methods and Tools on Calculating Text Similarity

Cited by 5 publications

References 9 publications

An Ensemble Machine Learning Technique for Functional Requirement Classification

An Ensemble Machine Learning Technique for Functional Requirement Classification

Development of the information system for the Kazakh language preprocessing

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

Contact Info

Product

Resources

About