Software engineers are able to measure the quality of their code using a variety of metrics that can be derived directly from analyzing the source code. These internal quality metrics are valuable to engineers, but the organizations funding the software development effort find external quality metrics such as defect rates and time to develop features more valuable. Unfortunately, external quality metrics can only be calculated after costly software has been developed and deployed for end-users to utilize. Here, we present a method for mining data from freely available open source codebases written in Java to train a Random Forest classifier to predict which files are likely to be external quality hotspots based on their internal quality metrics with over 75% accuracy. We also used the trained model to predict hotspots for a Java project whose data was not used to train the classifier and achieved over 75% accuracy again, demonstrating the method’s general applicability to different projects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.