To ensure the delivery of high quality software, it is necessary to ensure that all of its artifacts function properly, which is usually done by performing appropriate tests with limited resources. It is therefore desirable to identify defective artifacts so that they can be corrected before the testing process. So far, researchers have proposed various predictive models for this purpose. Such models are typically trained on data representing previous project versions of a software and then used to predict which of the software artifacts in the new version are likely to be defective. However, the data representing a software project usually consists of measurable properties of the project or its modules, and leaves out information about the timeline of the software development process. To fill this gap, we propose a new set of metrics, namely aggregated change metrics, which are created by aggregating the data of all changes made to the software between two versions, taking into account the chronological order of the changes. In experiments conducted on open source projects written in Java, we show that the stability and performance of commonly used classification models are improved by extending a feature set to include both measurable properties of the analyzed software and the aggregated change metrics.INDEX TERMS Classification, feature engineering, process metrics, change metrics, software defect prediction.
Software defect prediction aims to identify potentially defective software modules to better allocate limited quality assurance resources. Practitioners often do this by utilizing supervised models trained using historical data. This data is gathered by mining version control and issue tracking systems. Version control commits are linked to issues they address. If the linked issue is classified as a bug report, the change is considered as bug fixing. The problem arises from the fact that issues are often incorrectly classified within issue tracking systems. This introduces noise into the gathered datasets. In this paper, we investigate the influence issue classification has on software defect prediction dataset quality and resulting model performance. To do this, we mine data from 7 popular open-source repositories, create issue classification and software defect prediction datasets for each of them. We investigate issue classification using four different methods; a simple keyword heuristic, an improved keyword heuristic, the FastText model and the RoBERTa model. Our results show that using the RoBERTa model for issue classification produces the best software defect prediction datasets, containing on average 14.3641% of mislabeled instances. SDP models trained on such datasets achieve superior performance, to those trained on SDP datasets created using other issue classification methods, in 65 out of 84 experiments, with 55 of them being statistically relevant. Furthermore, in 17 out of 28 experiments we could not show a statistically relevant performance difference between SDP models trained on RoBERTa derived software defect prediction datasets and those created using manually labeled issues.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.