Many studies have explored the methods of deriving thresholds of object-oriented (i.e. OO) metrics. Unsupervised methods are mainly based on the distributions of metric values, while supervised methods principally rest on the relationships between metric values and defect-proneness of classes. The objective of this study is to empirically examine whether there are effective threshold values of OO metrics by analyzing existing threshold derivation methods with a large-scale meta-analysis. Based on five representative threshold derivation methods (i.e. VARL, ROC, BPP, MFM, and MGM) and 3268 releases from 65 Java projects, we first employ statistical meta-analysis and sensitivity analysis techniques to derive thresholds for 62 OO metrics on the training data. Then, we investigate the predictive performance of five candidate thresholds for each metric on the validation data to explore which of these candidate thresholds can be served as the threshold. Finally, we evaluate their predictive performance on the test data. The experimental results show that 26 of 62 metrics have the threshold effect and the derived thresholds by meta-analysis achieve promising results of GM values and significantly outperform almost all five representative (baseline) thresholds.
Because defects in software modules (e.g., classes) might lead to product failure and financial loss, software defect prediction enables us to better understand and control software quality. Software development is a dynamic evolutionary process that may result in data distributions (e.g., defect characteristics) varying from version to version. In this case, effective cross‐version defect prediction (CVDP) is not easy to achieve. In this paper, we aim to investigate whether the defect prediction method of the threshold‐based active learning (TAL) can tackle the problem of the different data distribution between successive versions. Our TAL method includes two stages. At the active learning stage, a committee of investigated metrics is constructed to vote on the unlabeled modules of the current version. We pick up the unlabeled module with the median of voting scores to domain experts. The domain experts test and label the selected unlabeled module. Then, we merge the selected labeled module and the remaining modules with pseudo‐labels from the current version into the labeled modules of the prior version to form enhanced training data. Based on the training data, we derive the metric thresholds used for the next iteration. At the defect prediction stage, the iterations stop when a predefined threshold is reached. Finally, we use the cutoff threshold of voting scores, that is, 50%, to predict the defect‐prone of the remaining unlabeled modules. We evaluate the TAL method on 31 versions of 10 projects with three prevalent performance indicators. The results show that TAL outperforms the baseline methods, including three variations methods, two common supervised methods, and the state‐of‐the‐art method Hybrid Active Learning and Kernel PCA (HALKP). The results indicate that TAL can effectively address the different data distribution between successive versions. Furthermore, to keep the cost of extensive testing low in practice, selecting 5% of candidate modules from the current version is sufficient for TAL to achieve a good performance of defect prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.