The task of identifying plagiarism between texts in different languages is an important variation of the general problem of identifying plagiarism. To solve this problem it is productive to calculate the degree of certain similarity of two texts which is called parallelism. In the article, the method of parallelism estimation based on Zipfian frequency distribution is studied. The key idea of the method is the construction of a linear regression model that compares the areas under the linearized Zipf curve for the corresponding documents. A computational procedure has been implemented to find the optimal classification parameters for such a model. To obtain a model more relevant to specific application conditions, computational experiments were performed to determine the optimal parameters corresponding to two classification metrics: the proportion of correct answers (accuracy) and F1-measure. The determination of the best classification parameters performed on the basis of the training subset of the corporal. To reliably estimate the model, classification metrics are recalculated on a test subset. The performed computational experiments using this approach showed limited applicability to language pairs composed of English, Russian and Ukrainian texts. To improve the filtering performance of parallel texts, a filter based on word frequencies in texts is proposed and implemented. To improve the quality of classification two directions have been formulated: an extension of the text corpora used in the model training, as well as methods for mutual using several classification filters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.