Towards the Detection of Cross-Language Source Code Reuse

Flores, Enrique; Barrón-Cedeño, Alberto; Rosso, Paolo; Moreno, Lidia

doi:10.1007/978-3-642-22327-3_31

Cited by 29 publications

(16 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Cross-language plagiarism detection is discussed in paper "Towards the detection of crosslanguage source code reuse" (Flores et al, 2011) whose authors found that methods applied for natural text (specifically n-gram comparison) work for Java, C and Python too. The other method might be comparison of an intermediate code produced by a special compiler suite.…”

Section: Theoretical Frameworkmentioning

confidence: 99%

“…Even as the scope has become broader, plagiarism remains one of the most important academic integrity issues appearing in student assignments undertaken individually or in groups, without direct supervision. In the digital era, massive amounts of information are available to reuse for anyone struggling with an assignment who is tempted to plagiarise (Flores et al, 2011).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Source Code Plagiarism Detection for PHP Language

Všianský¹,

Dlabolová²,

Foltýnek³

2017

EJOBSAT

View full text Add to dashboard Cite

This paper introduces a system for detection of plagiarism in source codes written in the PHP computer language, part of the plagiarism detection tool Anton. We used the greedy string tiling algorithm together with tokenization and hash calculation. The efficiency of the system was tested on both an artificial dataset and on real data coming from a course taught at our university. Our results are compared with other similar systems and solutions, concluding that Anton can detect all examined types of plagiarism with higher accuracy than other systems.

show abstract

Section: Theoretical Frameworkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Source Code Plagiarism Detection for PHP Language

Všianský¹,

Dlabolová²,

Foltýnek³

2017

EJOBSAT

View full text Add to dashboard Cite

show abstract

“…Such methods, as stated in [21], involve more complex as well as robust approaches. Normally, source code files are treated as text files, hence, common methods such as the traditional Bagof-Words, character n-grams [12,22], and longest common sub-sequence [2,15] are among the most popular techniques. One of such work takes into account the "whitespace" indentation patterns of a source code file [2], where a source code document is converted to a pattern, namely whitespace format, replacing any visible character by X and any whitespace by S, and leaving newlines as they appear.…”

Section: Related Workmentioning

confidence: 99%

“…Consists of the character 3-gram based model proposed in [12]. In this model, the source code is considered as a text and represented as character 3-grams, where these n-grams are weighted using term frequency scheme.…”

Section: </Document>mentioning

confidence: 99%

On the Detection of SOurce COde Re-use

Flores

Rosso

Moreno

et al. 2015

Proceedings of the Forum for Information Retrieval Evaluation on - FIRE '14

Self Cite

View full text Add to dashboard Cite

This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software industry and academia fields. Accordingly, PAN@FIRE track, named SOurce COde Re-use (SOCO) focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total five teams submitted 17 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works on the posed shared task.

show abstract

“…Por otro lado, los sistemas extrínsecos cuentan con una colección de códigos fuente confiables contra la cual se compara el código sospechoso. De esta manera, tratan de detectar si alguno de los códigos fuente confiables se han reutilizado o incluso si ha sido reutilizado el código completo de alguno o varios de ellos [6,7].…”

Section: Antecedentes Y Estado Del Arteunclassified

Herramienta de apoyo en la detección de reutilización de código fuente

Picazo-Alvarez¹,

Villatoro-Tello²,

Luna-Ramírez³

et al. 2014

RCS

View full text Add to dashboard Cite

Resumen. El acto de tomar parcial o totalmente contenidos generados por otras personas, y presentarlos como propios, sin dar el crédito correspondiente a los autores, es una forma indebida de reutilización de contenidos, considerada como plagio. Desafortunamente, en la actualidad, dada la amplia disponibilidad de contenidos a través de Internet, esta práctica se ha incrementado. La gran mayoría de los contenidos disponibles en la Web son materiales multimedia, aplicaciones y sobre todo textos, y todos ellos son susceptibles de plagio. En este artículo se haceénfasis en una clase de textos en particular: los programas escritos en algún lenguaje de programación, denominados código fuente. Dada la facilidad de acceso y las prácticas de reutilización de contenidos sin citar las fuentes (el abuso de la posibilidad de "Copiar y Pegar ", derivado de deficiencias metodológicas o bien como acción deliberada), surge la necesidad de contar con herramientas para combatir el plagio, en especial, de código fuente. En el presente trabajo se propone una herramienta orientada a detectar la reutilización de código fuente en programas escritos en un mismo lenguaje de programación. Las técnicas aplicadas se basan en la detección de la similitud entre dos programas, a través del uso de su Frecuencia de Términos (TF) y su Frecuencia Inversa (TF-IDF), considerando como términos conjuntos de n-gramas de caracteres presentes en cada uno de ellos.Palabras clave: n-gramas de caracteres, representación vectorial, similitud de documentos, reutilización de código fuente, procesamiento del lenguaje natural. IntroducciónLa disponibilidad de grandes cantidades de información a través del acceso a Internet permite a millones de usuarios consultar información y materiales muy diversos. La cantidad de información accesible está en constante crecimiento, y se ha acelerado con la denominada Web 2.0, que permite a los usuarios la producción y publicación de materiales de distinta naturaleza. Esto ha sido posible, entre otras cosas, por la facilidad de reproducir y reutilizar contenidos ya existentes en formato digital. Sin embargo, muchas de estas reproducciones

show abstract

Towards the Detection of Cross-Language Source Code Reuse

Cited by 29 publications

References 5 publications

Source Code Plagiarism Detection for PHP Language

Source Code Plagiarism Detection for PHP Language

On the Detection of SOurce COde Re-use

Herramienta de apoyo en la detección de reutilización de código fuente

Contact Info

Product

Resources

About