Abstract-Machine learning classifiers have recently emerged as a way to predict the introduction of bugs in changes made to source code files. The classifier is first trained on software history, and then used to predict if an impending change causes a bug. Drawbacks of existing classifier-based bug prediction techniques are insufficient performance for practical use and slow prediction times due to a large number of machine learned features. This paper investigates multiple feature selection techniques that are generally applicable to classification-based bug prediction methods. The techniques discard less important features until optimal classification performance is reached. The total number of features used for training is substantially reduced, often to less than 10 percent of the original. The performance of Naive Bayes and Support Vector Machine (SVM) classifiers when using this technique is characterized on 11 software projects. Naive Bayes using feature selection provides significant improvement in buggy F-measure (21 percent improvement) over prior change classification bug prediction results (by the second and fourth authors [28]). The SVM's improvement in buggy F-measure is 9 percent. Interestingly, an analysis of performance for varying numbers of features shows that strong performance is achieved at even 1 percent of the original number of features.
Abstract-Recently, machine learning classifiers have emerged as a way to predict the existence of a bug in a change made to a source code file. The classifier is first trained on software history data, and then used to predict bugs. Two drawbacks of existing classifier-based bug prediction are potentially insufficient accuracy for practical use, and use of a large number of features. These large numbers of features adversely impact scalability and accuracy of the approach. This paper proposes a feature selection technique applicable to classification-based bug prediction. This technique is applied to predict bugs in software changes, and performance of Naïve Bayes and Support Vector Machine (SVM) classifiers is characterized.
Software monitoring systems have high performance overhead because they typically monitor all processes of the running program. For example, to capture and replay crashes, most current systems monitor all methods; thus yielding a significant performance overhead. Lowering the number of methods being monitored to a smaller subset can dramatically reduce this overhead. We present an approach that can help arrive at such a subset by reliably identifying methods that are the most likely candidates to crash in a future execution of the software. Our approach involves learning patterns from features of methods that previously crashed to classify new methods as crash-prone or non-crash-prone. An evaluation of our approach on two large open source projects, ASPECTJ and ECLIPSE, shows that we can correctly classify crash-prone methods with an accuracy of 80-86%. Notably, we found that the classification models can also be used for cross-project prediction with virtually no loss in classification accuracy. In a further experiment, we demonstrate how a monitoring tool, RECRASH could take advantage of only monitoring crash-prone methods and thereby, reduce its performance overhead and maintain its ability to perform its intended tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.