Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Roy, Chanchal K.; Cordy, James R.; Koschke, Rainer

doi:10.1016/j.scico.2009.02.007

Cited by 809 publications

(521 citation statements)

References 82 publications

Supporting

Mentioning

507

Contrasting

Unclassified

Order By: Relevance

“…SourcererCC has perfect recall for first three clone types, including the most difficult Type-3 clones, for Java, C and C#. This tells us that it's clone detection algorithm is capable of handling all the types of edits developers make on copy and pasted code for these languages, as outlined in the editing taxonomy for cloning [27].…”

Section: Recall Measured By the Mutation Frameworkmentioning

confidence: 95%

See 1 more Smart Citation

SourcererCC

Sajnani

Saini

Svajlenko

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

Self Cite

354

View full text Add to dashboard Cite

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized invertedindex to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

show abstract

Section: Recall Measured By the Mutation Frameworkmentioning

confidence: 95%

“…As per mutation-analysis, this is repeated thousands of times. Further details, including a list of the mutation operators, is available in our earlier studies [23,27,35]. Procedure.…”

Section: Recall Measured By the Mutation Frameworkmentioning

confidence: 99%

SourcererCC

Sajnani

Saini

Svajlenko

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

Self Cite

354

View full text Add to dashboard Cite

show abstract

“…The most established term is type-4 clone. Yet, the definition in Roy, Cordy & Koschke (2009) emphasises that the code fragments have to perform the same computation. We want to emphasise the similarity, however.…”

Section: Terminologymentioning

confidence: 99%

“…(Roy, Cordy & Koschke, 2009) Functionally similar clone (FSC) Code fragments that provide a similar functionality w.r.t a given definition of similarity but can be implemented quite differently…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

How are functionally similar code clones syntactically different? An empirical study and a benchmark

Wagner

Abdulkhaleq

Bogicevic

et al. 2016

PeerJ Computer Science

View full text Add to dashboard Cite

Background. Today, redundancy in source code, so-called ''clones'' caused by copy &paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, not caused by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactical differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in <16% of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

show abstract

Layered similarity detection for programming plagiarism and collusion on weekly assessments

Karnalim

Simón²,

Chivers

2022

Comp Applic In Engineering

View full text Add to dashboard Cite

When weekly programming assessments are used, it is often the case that some of them are either trivial or strongly directed. Common code similarity detectors are not particularly helpful with such assessments: some potential instances of misconduct are not selected for manual investigation as all submissions are expected to be similar and it is not feasible to check them all. Several dedicated similarity detectors have been developed to work with such assessments, but the experience is required to determine when to use them. This paper presents a similarity detector that works on many kinds of weekly assessments. It combines three‐layered types of similarity so that even within a set of highly similar submissions, program pairs are still sorted according to their levels of similarity. Our similarity detector is more effective than JPlag in distinguishing similar programs and helping to identify plagiarism and collusion. The similarity detector is slower than JPlag, but the longer execution time is partly offset by some optimization that has no negative impact on the effectiveness. As weekly assessments seldom entail large submissions, the execution time does not appear to be a barrier to use.

show abstract

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Cited by 809 publications

References 82 publications

SourcererCC

SourcererCC

How are functionally similar code clones syntactically different? An empirical study and a benchmark

Layered similarity detection for programming plagiarism and collusion on weekly assessments

Contact Info

Product

Resources

About