On the use of evaluation measures for defect prediction studies

Moussa, Rebecca; Sarro, Federica

doi:10.1145/3533767.3534405

Cited by 18 publications

(9 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With that in mind, the greater the MCC, the better the solution. We opted to use MCC to assess and compare the accuracy of models as this measure has been strongly recommend in alternative to other previously popular measures, such as F-measure, which have been shown to be biased [51,63,73] when the data is imbalanced (as it is frequently the case in DP). MCC is a more balanced measure which, unlike the other measures, takes into account all the values of the confusion matrix [51,63].…”

Section: Fitness Functionsmentioning

confidence: 99%

“…We use MCC to evaluate the prediction performance of the models given that we do not target a specific business context [45,51], and, as explained in Section 3.2, MCC is a comprehensive measure, which provides a full picture of the confusion matrix by assessing all its aspects equally. It is also not sensitive to highly imbalanced data and is widely used in the defect prediction and machine learning literature [51,63,73].…”

Section: Evaluation Criteriamentioning

confidence: 99%

“…Consequently, the results of these techniques may differ from one run to another, causing unwanted variations in our experimental analysis. In order to mitigate this variability, and to provide a robust analysis, both DIVACE and MEG were executed 30 times and the median MCC values were reported (as opposed to means given that the latter is known to be more susceptible to outliers [64]), as suggested by best practices for the assessment of randomised optimisation algorithms and prediction systems [7,51,55,64]. Moreover, we perform bootstrapping during training by using a same random seed for all techniques used across all our experiments.…”

Section: Threats To Validitymentioning

confidence: 99%

“…This ensures that any variation or difference in results is due to the nature of the techniques themselves and not due to a different random data sampling. Finally, we used a robust and unbiased measure, such as MCC, to evaluate the prediction capabilities of all the approaches investigated herein, and performed statistical tests, including both hypothesis testing and effect size, by carefully checking all the required assumption, such that our conclusions could be backed up by scientifically sound evidence [51].…”

Section: Threats To Validitymentioning

confidence: 99%

“…To assess the effectiveness of MEG we carried out a thorough large-scale empirical study involving a total of 24 real-world software versions and 16 cross-version defect prediction scenarios, assessed according to the latest best practice for the evaluation of defect prediction and search-based approaches [7,51,55,64].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MEG: Multi-objective Ensemble Generation for Software Defect Prediction

Moussa

Guizzo

Sarro

2022

Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

Self Cite

View full text Add to dashboard Cite

Background: Defect Prediction research aims at assisting software engineers in the early identification of software defect during the development process. A variety of automated approaches, ranging from traditional classification models to more sophisticated learning approaches, have been explored to this end. Among these, recent studies have proposed the use of ensemble prediction models (i.e., aggregation of multiple base classifiers) to build more robust defect prediction models. Aims: In this paper, we introduce a novel approach based on multi-objective evolutionary search to automatically generate defect prediction ensembles. Our proposal is not only novel with respect to the more general area of evolutionary generation of ensembles, but it also advances the state-of-the-art in the use of ensemble in defect prediction. Method: We assess the effectiveness of our approach, dubbed as Multi-objective Ensemble Generation (MEG), by empirically benchmarking it with respect to the most related proposals we found in the literature on defect prediction ensembles and on multi-objective evolutionary ensembles (which, to the best of our knowledge, had never been previously applied to tackle defect prediction). Result: Our results show that MEG is able to generate ensembles which produce similar or more accurate predictions than those achieved by all the other approaches considered in 73% of the cases (with favourable large effect sizes in 80% of them). Conclusions: MEG is not only able to generate ensembles that yield more accurate defect predictions with respect to the benchmarks considered, but it also does it automatically, thus relieving the engineers from the burden of manual design and experimentation. CCS CONCEPTS• Software and its engineering → Search-based software engineering; Software defect analysis; • Computing methodologies → Ensemble methods.

show abstract

Section: Fitness Functionsmentioning

confidence: 99%

Section: Evaluation Criteriamentioning

confidence: 99%

Section: Threats To Validitymentioning

confidence: 99%

Section: Threats To Validitymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MEG: Multi-objective Ensemble Generation for Software Defect Prediction

Moussa

Guizzo

Sarro

2022

Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

Self Cite

View full text Add to dashboard Cite

show abstract

Unveiling the impact of unchanged modules across versions on the evaluation of within‐project defect prediction models

Liu,

Zhou,

et al. 2024

J Software Evolu Process

View full text Add to dashboard Cite

BackgroundSoftware defect prediction (SDP) is a topic actively researched in the software engineering community. Within‐project defect prediction (WPDP) involves using labeled modules from previous versions of the same project to train classifiers. Over time, many defect prediction models have been evaluated under the WPDP scenario.ProblemData duplication poses a significant challenge in current WPDP evaluation procedures. Unchanged modules, characterized by identical executable source code, are frequently present in both target and source versions during experimentation. However, it is still unclear how and to what extent the presence of unchanged modules affects the performance assessment of WPDP models and the comparison of multiple WPDP models.MethodIn this paper, we provide a method to detect and remove unchanged modules from defect datasets and unveil the impact of data duplication in WPDP on model evaluation.ResultsThe experiments conducted on 481 target versions from 62 projects provide evidence that data duplication significantly affects the reported performance values of individual learners in WPDP. However, when ranking multiple WPDP models based on prediction performance, the impact of removing unchanged instances is not substantial. Nevertheless, it is important to note that removing unchanged instances does have a slight influence on the selection of models with better generalization.ConclusionWe recommend that future WPDP studies take into consideration the removal of unchanged modules from target versions when evaluating the performance of their models. This practice will enhance the reliability and validity of the results obtained in WPDP research, leading to improved understanding and advancements in defect prediction models.

show abstract