Source code plagiarism can be identified by analyzing several and diverse views of a pair of source code. In this paper we present three representations from lexical and structural views of a given source code. We attempt to show that different representations provide diverse information that can be useful to identify plagiarism. In particular, we present representations based on 3-grams of characters, data type of function's signatures and names of the identifiers of function's signatures. While we used only three representations, more representations can be added. We conducted our analysis over a collection of 79 source code written in C language. Our results show that n-gram representation is very informative, but also that representations taken from the function's signatures are, to some extend, complementaries.
Resumen Es común el problema de requerir el acceso a la información contenida en un texto con un vocabulario más alla del léxico general (por ejemplo, reglamentos o leyes, contratos, e instrucciones de cuidado, entre otros) cuando no se cuenta con diccionarios, glosarios, terminología, o tesauros, todos ellos de un dominio especial. En este trabajo se explora la generación automática de un tesauro a partir del texto de un documento semiestructurado, para establecer un acercamiento a los componentes de este proceso e iniciar un análisis sobre las variables que influyen en la extracción de una parte de la semántica de textos de dominio particular. Se aplicó una adaptación del método SEXTANT a varios textos de dominio especial para generar un tesauro. La revisión de las parejas de términos relacionadas y el texto original nos llevan a concluir una formulación que relaciona las características del texto y productividad del método.Palabras clave: Tesauro, extracción semántica, estilo de texto.Abstract. It is common the need to access to the information at documents which use vocabulary far from the usual lexicon (for example laws, agreements, instructions, etc.) when there is no dictionaries, glossaries or thesauri, all of this fitting for a special domain. In this work it is explored the automatic thesauri generation from the text coming of a semi-structured document. In order to establish a view of the components of this process and begin an analysis about the variables that affect on the extraction of a portion of semantic in texts belonging to a specific domain, it was applied an adaptation of the SEXTANT method to several texts from different domain in order to automatically build, for each one, an specific thesaurus. The review of the related pairs of terms and the source text makes us to conclude the existence of a relation between text characteristics and the productivity of the method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.