Abstract:The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on toke… Show more
“…Analysis by applying the duplicate code model of this study, we first construct the similarity matrix as shown in figure 1. After constructing the matrix, convert the similarity distance into a transaction set, and extract the set with the file number 1 as shown in the following code: (2,86), (3,27), (4,40), (5,45), (6,35), (7,28), (8,40), (9,45), (10,122), (11,62), (12,53), (13,56), (14,141), (15,149), (16,56), (17,54), (18,84), (19,69), (20,83) From the code above we can see that the obtained item set still contains too much information. The corresponding file data set details are shown in Figure 2.…”
“…Li Siyu proposed an intermediate representation code similarity detection method [14] . Michel Chilowicz et al combined function call graph with word sequence matching to detect code similarity [15,16] . Tokenization analysis also has literature [17].…”
In order to improve the efficiency and accuracy of program source code similarity detection, an improvement on the method of code detection is made according to some deficiencies of the current research. A similar code detection model based on frequent item sets is proposed. The model constructs frequent items set data to discover repetitive code collections and automatically divide file similarity attribution. The algorithm model does not need to consider the type of the code in the detection process, and has wide applicability, not only can detect the code files of different programming languages and grammars, but also can mark out similar codes and statistic the results. Simultaneously, through experimental comparison, it is proved that the model has high accuracy and processing efficiency.
“…Analysis by applying the duplicate code model of this study, we first construct the similarity matrix as shown in figure 1. After constructing the matrix, convert the similarity distance into a transaction set, and extract the set with the file number 1 as shown in the following code: (2,86), (3,27), (4,40), (5,45), (6,35), (7,28), (8,40), (9,45), (10,122), (11,62), (12,53), (13,56), (14,141), (15,149), (16,56), (17,54), (18,84), (19,69), (20,83) From the code above we can see that the obtained item set still contains too much information. The corresponding file data set details are shown in Figure 2.…”
“…Li Siyu proposed an intermediate representation code similarity detection method [14] . Michel Chilowicz et al combined function call graph with word sequence matching to detect code similarity [15,16] . Tokenization analysis also has literature [17].…”
In order to improve the efficiency and accuracy of program source code similarity detection, an improvement on the method of code detection is made according to some deficiencies of the current research. A similar code detection model based on frequent item sets is proposed. The model constructs frequent items set data to discover repetitive code collections and automatically divide file similarity attribution. The algorithm model does not need to consider the type of the code in the detection process, and has wide applicability, not only can detect the code files of different programming languages and grammars, but also can mark out similar codes and statistic the results. Simultaneously, through experimental comparison, it is proved that the model has high accuracy and processing efficiency.
“…Zhuo Li et al [9] combined the dynamic text matching algorithm with suffix tree algorithm for similitude code within source files, achieved a similar code detection tool, actually united the method of abstract syntax tree. Michel Chilowicz et al [10] through the factorization of the function call graphs, detected the similarity of source code from the function level. Sharma A et al [11,12] determined the similarity of two functions according to the similarity of the internal operating instructions, and eventually get the similarity of the two applications.…”
The main purpose of the study is to find the code with similar possibilities to effectively avoid the adverse effects of code duplication. Through the clustering pretreatment of document feature information, to extract the relevant features of the document. Then the basic characteristics are used to cluster the document, to find out the best number of clusters. According to the reasonable number of clusters that have been found, using the vectors that generated through TF-IDF method, combined the K-means clustering algorithm to distinguish the contents of the files, as well as the introduction of cosine similarity, to determine the similarity of two texts and classify the parallel documents. From the test data set, the method can accurately find the code with the possibility of duplication and works quiet well.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.