A Technique for Just-in-Time Clone Detection in Large Scale Systems

Barbour, Liliane; Yuan, Haitao; Zou, Ying

doi:10.1109/icpc.2010.13

Cited by 9 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A multidimensional token-level indexing approach has been introduced by Lee et al [5] using an ‫כ‬ on DECKARD's [10] approximate vector matching. Optimization on the repository size using sampling techniques is another approach to achieve scalable real-time clone search (e.g., Barbour et al [2]). A more diverse approach to tree-based real-time clone search is hash table-based indexing.…”

Section: Definitions and Basic Termsmentioning

confidence: 99%

See 1 more Smart Citation

Internet-scale Real-time Code Clone Search Via Multi-level Indexing

Keivanloo

Rilling

Charland

2011

2011 18th Working Conference on Reverse Engineering

View full text Add to dashboard Cite

Finding lines of code similar to a code fragment across large knowledge bases in fractions of a second is a new branch of code clone research also known as real-time code clone search. Among the requirements real-time code clone search has to meet are scalability, short response time, scalable incremental corpus updates, and support for type-1, type-2, and type-3 clones. We conducted a set of empirical studies on a large open source code corpus to gain insight about its characteristics. We used these results to design and optimize a multi-level indexing approach using hash table-based and binary search to improve Internet-scale real-time code clone search response time. Finally, we performed an evaluation on an Internet-scale corpus (1.5 million Java files and 266 MLOC). Our approach maintains a response time for 99.9% of clone searches in the microseconds range, while supporting the aforementioned requirements.

show abstract

Section: Definitions and Basic Termsmentioning

confidence: 99%

“…Although there exists a large body of research on code clone detection [1], the one on real-time clone search is limited. It is still a rather new research area, also known as just-in-time [2], real-time [3,4], instant [5], or online clone search. It aims at finding all the fragments matching the input code fragment.…”

Section: Introductionmentioning

confidence: 99%

Internet-scale Real-time Code Clone Search Via Multi-level Indexing

Keivanloo

Rilling

Charland

2011

2011 18th Working Conference on Reverse Engineering

View full text Add to dashboard Cite

show abstract

“…Text-based approaches: Text based approaches compare two code fragments based on the input text or string. The tools like Duploc [4], simian [18], EqMiner [19], NICAD [20], DuDe [21]. Except for NICAD none of the tools address detecting even small instances of Type 3.…”

Section: Software Clone Detectionmentioning

confidence: 99%

“…Tree based approaches work by parsing the source code to parse tree. tools like Deckard [35], CloneDR [36], simScan [37] , Asta [38], CloneDigger [39], sim [40], ClemanX [41], JCCD API [42], CloneDetection [43], cpdetector [34]. These techniques did not detect type 4 clones.…”

Section: Token-based Techniquesmentioning

confidence: 99%

A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification

Ankali¹,

Parthiban²

2021

IJMECS

View full text Add to dashboard Cite

Code clone detection plays a vital role in both industry and academia. Last three decades have seen more than 250 clone detection techniques with lack of single framework that can detect and classify all 4 basic types of code clones with high precision. This serious lack of clone classification impacts largely on the universities and online learning platforms that fail to validate the projects or coding assignments submitted online. In this paper, we propose a complete and language agnostic technique to detect and classify all 4 clone types of C, C++, and Java programs. The method first generates the parse tree then extracts the functional tree to eliminate the need for the preprocessing stage employed by previous clone detection techniques. The generated parse tree contains all the necessary information for detecting code clones. We employ TF-IDF cosine similarity for the proper classification of clone types. The proposed technique achieves incredible precision rate of 100% in detecting the first two types of clones and 98% precision in detecting type-3 and type-4 clones for small codes of C, C++, and Java containing an average line count of 5. The proposed technique outperforms the existing tree-based clone detection tools by providing the average precision of 98.07% on the C, C++, and Java programs crawled from Github with an average line count of 15 which signifies that cosine similarity measure on ANTLR functional tree accurately detects all 4 types of small clones and act as proper validation tools for identifying the learning level in the submitted programming assignment.

show abstract

“…Fragment search aims at finding clones of one particular code fragment . This type of code search is used, for instance, to localize code fragments where similar defects must be corrected, to find similar reusable code or examples for working solutions for a given problem, or to avoid cloning or update anomalies within an integrated development environment while a programmer is working on a particular piece of code .…”

Section: Introductionmentioning

confidence: 99%

Large‐scale inter‐system clone detection using suffix trees and hashing

Koschke

2013

J Software Evolu Process

View full text Add to dashboard Cite

Detecting a similar code between two systems has various applications such as comparing two software variants or versions or finding potential license violations. Techniques detecting suspiciously similar code must scale in terms of resources needed to very large code corpora and need to have high precision because a human needs to inspect the results. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. The evaluation is carried out for very large code corpora. Our evaluation shows that our approach is faster than index-based techniques when the analysis is run only once. If the analysis is to be conducted multiple times, creating an index pays off. We report how much code can be filtered out from the analysis using an index-based filter. In addition to that, this paper proposes a method to improve precision through user feedback. A user validates a sample of the found clone candidates. An automated data mining technique learns a decision tree on the basis of the user decisions and different code metrics. We investigate the relevance of several metrics and whether criteria learned from one application domain can be generalized to other domains.All of the aforementioned variants of clone detection are facing challenges with respect to detection quality and scalability. Detection quality requires high recall and high precision in finding the relevant code. Relevance depends on the use case. In particular, inter-system and intra-system clone detections need to deal with re-occurring similar code that is similar from a lexical or syntactical point of view, but that is not interesting for the given task. Frequent examples of such irrelevant similar code are import statement lists, array initializers, setter/getter sequences, or sequences of pure declarations or simple assignments.Another challenge is scalability. Whereas intra-system clone detection searches only within one system, inter-system clone search may face a much larger code base, often larger by orders of magnitude. Also, fragment search may face this problem, when the code is searched in very large software repositories [3,4].Several researchers have recently proposed to use an index-based code search to address scalability for the search in very large code bases [13,3,4,17,18].The index-based techniques first create an index against which code of a subject system is compared later. The purpose of the index is to identify the code that has a chance of being similar. The code filtered out by the index is not compared. The index is a first seed of a similar code fragment. This seed is then extended by merging with neighboring similar code fragments [13,3,4].Creating the index can be expensive. The idea is to invest upfront in an index that is created only once but whose cost is amortized in multiple subsequent searches.Contributions. Our conference paper introduced a way to extend traditional suffix-tree-based clone detection for inter-system clone search that scales for very large programs [19]. This approach avoids the nee...

show abstract

A Technique for Just-in-Time Clone Detection in Large Scale Systems

Cited by 9 publications

References 12 publications

Internet-scale Real-time Code Clone Search Via Multi-level Indexing

Internet-scale Real-time Code Clone Search Via Multi-level Indexing

A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification

Large‐scale inter‐system clone detection using suffix trees and hashing

Contact Info

Product

Resources

About