Similar code may exist in large software projects due to some common software engineering practices, such as copying and pasting code and n-version programming. Although previous work has studied syntactic equivalence and small-scale, coarse-grained program-level and function-level semantic equivalence, it is not known whether significant fine-grained, code-level semantic duplications exist. Detecting such semantic equivalence is also desirable because it can enable many applications such as code understanding, maintenance, and optimization.In this paper, we introduce the first algorithm to automatically mine functionally equivalent code fragments of arbitrary sizedown to an executable statement. Our notion of functional equivalence is based on input and output behavior. Inspired by Schwartz's randomized polynomial identity testing, we develop our core algorithm using automated random testing: (1) candidate code fragments are automatically extracted from the input program; and (2) random inputs are generated to partition the code fragments based on their output values on the generated inputs. We implemented the algorithm and conducted a large-scale empirical evaluation of it on the Linux kernel 2.6.24. Our results show that there exist many functionally equivalent code fragments that are syntactically different (i.e., they are unlikely due to copying and pasting code). The algorithm also scales to million-line programs; it was able to analyze the Linux kernel with several days of parallel processing.
Studies show that programs contain much similar code, commonly known as clones. One of the main reasons for introducing clones is programmers' tendency to copy and paste code to quickly duplicate functionality. We commonly believe that clones can make programs difficult to maintain and introduce subtle bugs. Although much research has proposed techniques for detecting and removing clones to improve software maintainability, little has considered how to detect latent bugs introduced by clones. In this paper, we introduce a general notion of context-based inconsistencies among clones and develop an efficient algorithm to detect such inconsistencies for locating bugs. We have implemented our algorithm and evaluated it on large open source projects including the latest versions of the Linux kernel and Eclipse. We have discovered many previously unknown bugs and programming style issues in both projects (with 57 for the Linux kernel and 38 for Eclipse). We have also categorized the bugs and style issues and noticed that they exhibit diverse characteristics and are difficult to detect with any single existing bug detection technique. We believe that our approach complements well these existing techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.