Abstract-With the rapid growth of Internet technologies and applications, Text is still the most common Internet medium. Examples of this include social networking applications and web applications are also mostly text based. We developed a framework to determine an anonymous author's native language for short length, multi-genre such as the ones found in many Internet applications. In this framework, four types of feature sets (lexical, syntactic, structural, and content-specific features) are extracted and three machine learning algorithms (C4.5 decision tree, support vector machine and Naïve Bayes) are designed for author's native language identification based on the proposed features. To experiment this framework, we used English, Persian, Turkish and German online news texts. The experimental results showed that the proposed approach was able to identify author's native language in web-based texts with satisfactory accuracy of 70% to 80%. And Support vector machines outperformed the other two classification techniques in our experiments.
In this paper, the effect of machine translators in the textual data classification is examined by using supervised classification methods. The developed system first analyzes and classifies an input text in one language, and then analyzes and classifies the same text in another language generated by machine translators from the input text. The obtained results are compared to measure the effect of the translators in textual data classification. The performances of the classification method used in this study are also measured and compared. The classification process can be described as training data preparation, feature selection, and classification of the input texts with/without translation. The obtained results show that Multinomial Naïve Bayes method is the most successful method, and that the translation has quite a small effect on the attained classification accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.