Scalable Source Code Plagiarism Detection Using Source Code Vectors Clustering

Duracik, Michal; Kršák, Emil; Hrkut, Patrik

doi:10.1109/icsess.2018.8663708

Cited by 9 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The last phase deals with obtaining individual matches from the database and their evaluation. In this phase, before the actual generation of the report, a filter of non-significant matches can be included which serves to clarify the reports [30]. In the filter, it is possible to define patterns that will not be taken into account in the evaluationfor example, commands for importing packages, generated commands and others.…”

Section: Figure 1 Structure Of the Designed Systemmentioning

confidence: 99%

“…Finally, we designed a method of data persistence [30] based on clustering using a relational databasesee FIGURE 9. When dealing with data persistence, we also dealt with the efficiency of the search in this data structure.…”

Section: Figure 8 Incremental Clustering Schemementioning

confidence: 99%

“…The last part of the designed system for searching for plagiarism in the source code is to search for similar parts of the source code. The basis of our plagiarism search algorithm is the data structure (see FIGURE 9) designed in our previous work [30]. The disadvantage of such a design is that we are able to generate a report only for works (students´ assignments) that we have already added to the database before.…”

Section: Search For the Matching Parts Of The Source Codementioning

confidence: 99%

“…The plagiarism search algorithm consists of three parts. The first one is to obtain similarities from the database, the second one is to match and filter these similarities, and in the third part, the degree of similarity is calculated for the detected pairs of works [30].…”

Section: Figure 9 Data Structure For Filing Vectorsmentioning

confidence: 99%

See 3 more Smart Citations

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

et al. 2020

Self Cite

View full text Add to dashboard Cite

The paper deals with the issue of detecting plagiarism in source code, which we unfortunately encounter when teaching subjects dealing with programming and software development. Many students want to simplify the completion of the course and therefore submit modified source codes of their classmates or even those found on the Internet. Some try to modify the source code e.g. by changing the identifiers of classes, methods and variables to different ones, by changing the corresponding loops, by introducing new methods or by changing the order of methods in the source code or in other ways. We focused directly on this problem and designed our own anti-plagiarism system that we describe in this paper. The designed system consists of three parts during which the source code is processed using six designed algorithms. The basis is the processing of the source code and its transformation into an abstract syntax tree, consisting of two types of nodes, which is then vectorized using our modified DECKARD algorithm. The vectors are then clustered and stored in a database from which similar parts of the source code can be searched. The output of the system is then the final report containing a list of matches with similarities of all works that have been added to the database until then. The designed anti-plagiarism system is finally compared with the success of plagiarism detection performed by the two most used anti-plagiarism tools, namely JPlag and MOSS. It is evaluated on assignments elaborated by students from the courses dealing with object-oriented programming at our faculty.

show abstract

Section: Figure 1 Structure Of the Designed Systemmentioning

confidence: 99%

Section: Figure 8 Incremental Clustering Schemementioning

confidence: 99%

Section: Search For the Matching Parts Of The Source Codementioning

confidence: 99%

Section: Figure 9 Data Structure For Filing Vectorsmentioning

confidence: 99%

See 2 more Smart Citations

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…In code plagiarism detection, various methods and algorithms are used, including Token [12], Graph [13], Attribute [14], and Structure-based Detection [15]. Furthermore, structure or Parse-based Detection, used to generate Abstract Syntax Trees (AST), is suitable for identifying code plagiarism because it accurately represents the structure [16]. AST is also considered effective in identifying attempts to avoid detection systems, such as variable renaming, adding comments, and function rearrangement.…”

Section: Introductionmentioning

confidence: 99%

Damerau-Levenshtein Distance Algorithm Based on Abstract Syntax Tree to Detect Code Plagiarism

Nuraminah,

Ammar

2023

SJI

View full text Add to dashboard Cite

Purpose: This research aimed to detect source code plagiarism based on Abstract Syntax Tree using Damerau-Levenshtein Distance algorithm, which is expected to streamline the inaccuracies and time-consumption associated with the manual process.Methods: Damerau-Levenshtein Distance algorithm was used to determine the similarity between source code files and calculate F-Measure. The dataset, which consisted of 178 source code files from 20 coursework assignments, was obtained from GitHub by Lawton Nichols in 2019. Damerau-Levenshtein Distance algorithm was used to compute the minimum cost required to transform one line of code into another. Furthermore, ANTLR detected AST, which was processed through preprocessing, including node pruning, function and variable sorting, and log output removal. Result: The result showed that the two methods took 5.704 seconds and 0.996 seconds to complete. The lowest and highest values obtained using F-Measure were 0.16 and 0.8, respectively. Therefore, the system performed detection processes quickly and effectively detected common forms of code plagiarism with difficulty in the more complex forms. Novelty: In conclusion, this research used AST and Damerau-Levenshtein Distance algorithm to calculate the 5 levels of similarity in Java programming language source code. For further development, preprocessing steps were needed to prune unnecessary nodes and detect equivalent but differently syntaxed code.

show abstract

Smart Clustering of HPC Applications Using Similar Job Detection Methods

Shaikhislamov

Voevodin

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scalable Source Code Plagiarism Detection Using Source Code Vectors Clustering

Cited by 9 publications

References 6 publications

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set

Damerau-Levenshtein Distance Algorithm Based on Abstract Syntax Tree to Detect Code Plagiarism

Smart Clustering of HPC Applications Using Similar Job Detection Methods

Contact Info

Product

Resources

About