2019
DOI: 10.5334/dsj-2019-037
|View full text |Cite
|
Sign up to set email alerts
|

Application of Natural Language Processing Algorithms to the Task of Automatic Classification of Russian Scientific Texts

Abstract: This work is devoted to the study of applicability of modern methods of machine learning to the task of automatic classification of scientific articles and abstracts. For this purpose, the study of such models of machine learning as artificial neural networks, random forest, logistic regression, and support vector machine was carried out with taking into account such a feature of scientific texts as a large number of terms specific for various categories. Separately, the stages of data collection and extractio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 23 publications
(9 citation statements)
references
References 20 publications
1
8
0
Order By: Relevance
“…Comparing the performance of our results with that of similar research on automated classification of scientific literature is not straightforward but some observations can be made. For example, in [ 21 ] we see F-scores of around 0.50 which is in the same area as our experiment 2, which had the largest number of classes. This study had a much larger training set but it is difficult to compare the complexity of the tasks.…”
Section: Discussionsupporting
confidence: 73%
See 1 more Smart Citation
“…Comparing the performance of our results with that of similar research on automated classification of scientific literature is not straightforward but some observations can be made. For example, in [ 21 ] we see F-scores of around 0.50 which is in the same area as our experiment 2, which had the largest number of classes. This study had a much larger training set but it is difficult to compare the complexity of the tasks.…”
Section: Discussionsupporting
confidence: 73%
“…classification of mathematical research [ 19 ] and on general research literature with the purpose of applying the correct Dewey Decimal Classification code [ 20 ]. While much work focuses on classification of English-language literature, examples of using machine learning methods for automated coding of scientific literature in the Russian language [ 21 ]. Most approaches appear to be based on supervised learning but use of unsupervised learning also exists [ 20 ].…”
Section: Introductionmentioning
confidence: 99%
“…Among the ML algorithms, there are the Support Vector Machine (SVM) and Naïve Bayes (NB) algorithms which, in addition to being the most traditional algorithms, continue to provide good results. In Romanov et al [4] 99% accuracy was obtained regarding the classification of scientific texts based on their abstracts. However, this high acuity value reveals low precision and recall values, 61% and 36% respectively, which is not ideal.…”
Section: Data Classification Results -Mlmentioning
confidence: 99%
“…The most common approaches to the use of NLP techniques usually use a set of steps, in which the data obtained is processed. In the work of Romanov et al [4], in which a classification system for scientific texts in Russian was developed, an approach consisting of 5 steps was presented, namely: the removal of formulas that are frequent in scientific texts; the aggregation of metadata, which includes the title, keywords, and summary; transformation of data to lowercase; the removal of stop words that reduces the amount of existing information to just useful information; and the stemming of words, which consists of deflecting words to determine their lemma.…”
Section: Data Processing -Nlpmentioning
confidence: 99%
“…Real-world raw data is usually unsuitable for direct use in classifier training, so some cleaning and preprocessing steps are generally applied before the classification task. Thus, scientific contents must go through a Natural Language Processing (NLP) techniques for the data to be ready for classification [2].…”
Section: Introductionmentioning
confidence: 99%