Code duplication is a well-documented problem in industrial software systems. There has been considerable research into techniques for detecting duplication in software, and there are several effective tools to perform this task. However, there have been few detailed qualitative studies into how cloning actually manifests itself within software systems. This is primarily due to the large result sets that many clonedetection tools return; these result sets are very difficult to manage without complementary tool support that can scale to the size of the problem, and this kind of support does not currently exist. In this paper we present an in-depth case study of cloning in a large software system that is in wide use, the Apache Web server; we provide insights into cloning as it exists in this system, and we demonstrate techniques to manage and make effective use of the large result sets of clone-detection tools. In our case study, we found several interesting types of cloning occurrences, such as 'cloning hotspots', where a single subsystem comprising only 17% of the system code contained 38.8% of the clones. We also found several examples of cloning behavior that were beneficial to the development of the system, in particular cloning as a way to add experimental functionality.(1) facilities to evaluate the overall cloning situation;(2) mechanisms to guide users toward clones that are most relevant to their task; and (3) methods for filtering and refining the analysis of the clones.Each of these criteria is described in more detail below.
Overall system evaluationAs a first step in understanding cloning within a software system, regardless of the end goal, maintainers must have an understanding of the cloning from a high level of abstraction. This understanding will allow the user to evaluate the extent and the severity of the duplication in order to estimate the cost and/or necessity of the task.Several mechanisms can be used to evaluate cloning from a high level. Visualization methods, such as scatterplots [1,3,4,12,15], are useful for the discovery of highly related subsystems and high levels of cloning within a subsystem. They are also useful for detecting unusual types of cloning, such as cloning from system libraries to other parts of the software system. Metric-oriented reports, such as reporting the percentage of lines cloned, average length of a clone, etc., are useful for directing users to points in the system where the most cloning is occurring, or where cloning activities are unusually high in relation to subsystem size.
Guide and empower the userThe possibly large sets of clones returned by the clone-detection methods make it infeasible to look at each individual clone. There are several ways to direct users toward the clones they seek. Metrics can be used to query the dataset [16]. Some examples of metrics that might be used are the size of the clone, the types of changes made to the clone, and types of external dependencies a code segment has. Such a method can direct users to promising refactoring ...