Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available.In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: CrossLanguage Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T+MA); three inherently different models in nature and required resources. * Corresponding author Email addresses: albarron@lsi.upc.edu (Alberto Barrón-Cedeño), pgupta@dsic.upv.es (Parth Gupta), prosso@dsic.upv.es (Paolo Rosso)The authors appear in alphabetical order. A. Barrón and P. Gupta contributed equally to this work and should both be considered as first authors. Preprint submitted to Knowledge Based SystemsJune 14, 2013The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks-something never done before. The experiments show that T+MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.
For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. In this paper, we formally introduce the concept of MixedScript IR, and through analysis of the query logs of Bing search engine, estimate the prevalence and thereby establish the importance of this problem. We also give a principled solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. We present an extensive empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixedscript IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines.
Gupta, PA.; Rosso ., P. (2013). Cross-language plagiarism detection using multilingual semantic network. En Advances in Information Retrieval. Springer Verlag (Germany). 7814:710-713. doi:10.1007/978-3-642-36973-5_66. Cross-Language Plagiarism Detection using a Multilingual Semantic NetworkMarc Franco-Salvador, Parth Gupta, and Paolo Rosso Natural Language Engineering Lab -ELiRF, DSIC Universitat Politècnica de València, Valencia, Spain {mfranco,pgupta,prosso}@dsic.upv.esAbstract. Cross-language plagiarism refers to the type of plagiarism where the source and suspicious documents are in different languages. Plagiarism detection across languages is still in its infancy state. In this article, we propose a new graph-based approach that uses a multilingual semantic network to compare document paragraphs in different languages. In order to investigate the proposed approach, we used the German-English and Spanish-English cross-language plagiarism cases of the PAN-PC'11 corpus. We compare the obtained results with two state-of-the-art models. Experimental results indicate that our graph-based approach is a good alternative for cross-language plagiarism detection.
Abstract. This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.
Cross-language (CL) plagiarism detection aims at detecting plagiarised fragments of text among documents in different languages. In this work we perform a comparison of different methods that make use of continuous-space representations of language to perform the task of CL plagiarism detection. We also present continuous word alignment-based similarity analysis, a new model to estimate similarity between text fragments. In addition, we study the combination of our continuous representations with the knowledge-based similarity analysis model. We compare the aforementioned approaches with several state-of-the-art models and studied their performance in detecting different length and obfuscation types of plagiarism cases. We conduct experiments over Spanish-English and German-English datasets. Experimental results show that continuous representations allow the continuous word alignment-based similarity analysis model to obtain competitive results and the knowledge-based similarity analysis model to outperform existing stateof-the-art in CL plagiarism detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.