We propose a new token-based approach for large -scale code clone detection, which is based on a filtering heuristic that reduces the number of token comparisons when the two code blocks are compared. We also present a MapReduce based parallel algorithm that uses the filtering heuristic and scales to thousands of projects. The filtering heuristic is generic and can also be used in conjunction with other token-based approaches. In that context, we demonstrate how it can increase the retrieval speed and decrease the memory usage of the index-based approaches. In our experiments on 36 open source Java projects, we found that: (i) filtering reduces token comparisons by a factor of 10, and thus increasing the speed of clone detection by a factor of 1.5; (ii) the speed-up and scale-up of the parallel approach using filtering is nearlinear on a cluster of 2-32 nodes for 150-2800 projects; and (iii) filtering decreases the memory usage of index-based approach by half and the search time by a factor of 5.The presented approach is very general and can be used with other similarity function like Jaccard, Cosine, etc.
404HITESH SAJNANI, VAIBHAV SAINI AND CRISTINA LOPES Algorithm 5: Clone detection using efficient index-based Index search. Similar to the naive approach, given a query block b 1 , logically, detectClones() here also consists of the following two steps: (i) Fetch the candidate blocksthe terms in b 1 are first ordered using the globalTermPositionMap.Next, each term in the prefix of the ordered b 1 is searched in the partial index to retrieve the block ids of the candidate blocks. These block ids are added to candidatesList (line 21-30, Algorithm 5). It is important to note that unlike the naive approach, no similarity score is calculated here. This is because partial index does not index all the terms of the blocks, and similarity calculation requires all the terms of the candidate code block. In order to address this issue, we create another index that stores all the terms of a block id. We call this index as forward index because its purpose is exactly opposite to that of an inverted index. A forwardindex, when queried with a code block id, returns all the terms in that code block, whereas an 414 HITESH SAJNANI, VAIBHAV SAINI AND CRISTINA LOPES heuristic to improve index-based approaches. We demonstrated that filtering, indeed, can reduce the index size by half and decreases the search time by a factor of 5.5. Our parallel algorithm using filtering technique efficiently scales to thousands of projects and demonstrated near linear speed-up and scale-up. Moreover, its MapReduce based implementation has inherent advantages like load balancing, data replication, and fault tolerance over any other in-house distributed solutions where these things are to be dealt with explicitly. Support for replicating the study. We have made available the input dataset, tools, generated output, and the detailed steps to replicate the study at URL -http://mondego.ics.uci.edu/projects/ clonedetection. The web page has all the 36 subj...