Revisiting the evaluation of defect prediction models

Mende, Thilo; Koschke, Rainer

doi:10.1145/1540438.1540448

Cited by 146 publications

(110 citation statements)

References 30 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…Traditional performance metrics used in most previous work are precision, recall, f-measure, AUC [23], error sum, median error, error variance, and correlation [3]. Mende and Koschke [6], Arisholm et al [25], and Rahman et al [26] suggested that traditional performance metrics are not wellsuited for evaluating defect prediction approaches in a practical scenario. Indeed, for traditional metrics, all defect prone software artifacts have the same priority while software engineers would benefit from identifying those software components containing more defects earlier.…”

Section: Background and Problem Descriptionmentioning

confidence: 99%

“…Indeed, for traditional metrics, all defect prone software artifacts have the same priority while software engineers would benefit from identifying those software components containing more defects earlier. As pointed out by Mende and Koschke [6] and D'Ambros et al [3] the scenario that is more useful in practice is to rank the classes by the predicted number of defects they will exhibit. In the context of defect prediction, prediction models assign a defect probability to all classes, according to which they can be ranked.…”

Section: Background and Problem Descriptionmentioning

confidence: 99%

“…Thus, the main goal of defect prediction approaches is to maximize the percentage of defects that can be encountered when reviewing only the top k% of classes [3]. This can be shown visually via a cumulative lift chart [28], [29], where the classes are ordered according to the prediction model on the x axis while the cumulative number of actual defects is plotted on the y axis. Similarly, in the context of change prediction, Koru et al [30] proposed the usage of regression tree to predict the number of future changes in two releases of two software projects.…”

Section: Background and Problem Descriptionmentioning

confidence: 99%

“…More recent work on prediction models raised the issue that the effort that developers have to put into inspecting artifacts suggested by binary classification models varies depending on the artifact [6]. Larger and more complex software artifacts require additional inspection effort, which might limit both the usefulness and the effectiveness of the prediction, especially when the prediction is used as a means to better allocate resources.…”

Section: Introductionmentioning

confidence: 99%

“…In other words, prediction models should be effort-aware and try to rank software artifacts based on the changes/defects they will exhibit, placing those with most changes/defects at the top. In this case, commonly used evaluation metrics are: the Spearman's rank correlation between predicted and actual ranking, and the P opt metric [6], defined as the difference between the AUC of the optimal ranking (considering the inspection of n% of artifacts) and the AUC of the prediction model.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Untitled

Supplemental Information 1: Replication Package

View full text Add to dashboard Cite

Abstract-Research has yielded approaches for predicting future changes and defects in software artifacts, based on historical information, helping developers in effectively allocating their (limited) resources. Developers are unlikely able to focus on all predicted software artifacts, hence the ordering of predictions is important for choosing the right artifacts to concentrate on. We propose using a Genetic Algorithm (GA) for tailoring prediction models to prioritize classes with more changes/defects. We evaluate the approach on two models, regression tree and linear regression, predicting changes/defects between multiple releases of eight open source projects. Our results show that regression models calibrated by GA significantly outperform their traditional counterparts, improving the ranking of classes with more changes/defects by up to 48%. In many cases the top 10% of predicted classes can contain up to twice as many changes or defects.

show abstract

Section: Background and Problem Descriptionmentioning

confidence: 99%

Section: Background and Problem Descriptionmentioning

confidence: 99%

Section: Background and Problem Descriptionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Untitled

Supplemental Information 1: Replication Package

View full text Add to dashboard Cite

show abstract

Large‐scale inter‐system clone detection using suffix trees and hashing

Koschke

2013

J Software Evolu Process

Self Cite

View full text Add to dashboard Cite

Detecting a similar code between two systems has various applications such as comparing two software variants or versions or finding potential license violations. Techniques detecting suspiciously similar code must scale in terms of resources needed to very large code corpora and need to have high precision because a human needs to inspect the results. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. The evaluation is carried out for very large code corpora. Our evaluation shows that our approach is faster than index-based techniques when the analysis is run only once. If the analysis is to be conducted multiple times, creating an index pays off. We report how much code can be filtered out from the analysis using an index-based filter. In addition to that, this paper proposes a method to improve precision through user feedback. A user validates a sample of the found clone candidates. An automated data mining technique learns a decision tree on the basis of the user decisions and different code metrics. We investigate the relevance of several metrics and whether criteria learned from one application domain can be generalized to other domains.All of the aforementioned variants of clone detection are facing challenges with respect to detection quality and scalability. Detection quality requires high recall and high precision in finding the relevant code. Relevance depends on the use case. In particular, inter-system and intra-system clone detections need to deal with re-occurring similar code that is similar from a lexical or syntactical point of view, but that is not interesting for the given task. Frequent examples of such irrelevant similar code are import statement lists, array initializers, setter/getter sequences, or sequences of pure declarations or simple assignments.Another challenge is scalability. Whereas intra-system clone detection searches only within one system, inter-system clone search may face a much larger code base, often larger by orders of magnitude. Also, fragment search may face this problem, when the code is searched in very large software repositories [3,4].Several researchers have recently proposed to use an index-based code search to address scalability for the search in very large code bases [13,3,4,17,18].The index-based techniques first create an index against which code of a subject system is compared later. The purpose of the index is to identify the code that has a chance of being similar. The code filtered out by the index is not compared. The index is a first seed of a similar code fragment. This seed is then extended by merging with neighboring similar code fragments [13,3,4].Creating the index can be expensive. The idea is to invest upfront in an index that is created only once but whose cost is amortized in multiple subsequent searches.Contributions. Our conference paper introduced a way to extend traditional suffix-tree-based clone detection for inter-system clone search that scales for very large programs [19]. This approach avoids the nee...

show abstract

Prioritizing the creation of unit tests in legacy software systems

et al. 2011

View full text Add to dashboard Cite

SUMMARY Test‐driven development (TDD) is a software development practice that prescribes writing unit tests before writing implementation code. Recent studies have shown that TDD practices can significantly reduce the number of pre‐release defects. However, most TDD research thus far has focused on new development. We investigate the adaptation of TDD‐like practices for already‐implemented code, in particular legacy systems. We call such an adaptation ‘Test‐driven maintenance’ (TDM). In this paper, we present a TDM approach that assists software development and testing managers to use the limited resources they have for testing legacy systems efficiently. The approach leverages the development history of a project to generate a prioritized list of functions that managers should focus their unit test writing resources on. The list is updated dynamically as the development of the legacy system progresses. We evaluate our approach on two large software systems: a large commercial system and the Eclipse Open Source Software system. For both systems, our findings suggest that heuristics based on the function size, modification frequency and bug fixing frequency should be used to prioritize the unit test writing efforts for legacy systems. Copyright © 2011 John Wiley & Sons, Ltd.

show abstract

Revisiting the evaluation of defect prediction models

Cited by 146 publications

References 30 publications

Untitled

Untitled

Large‐scale inter‐system clone detection using suffix trees and hashing

Prioritizing the creation of unit tests in legacy software systems

Contact Info

Product

Resources

About