BACKGROUND -self evidently empirical analyses rely upon the quality of their data. Likewise replications rely upon accurate reporting and using the same rather than similar versions of data sets. In recent years there has been much interest in using machine learners to classify software modules into defectprone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research.OBJECTIVE -this short note investigates the extent to which published analyses based on the NASA defect data sets are meaningful and comparable.
METHOD -we analyse the five studies published in IEEE Transactions on Software Engineering since2007 that have utilised these data sets and compare the two versions of the data sets currently in use.RESULTS -we find important differences between the two versions of the data sets, implausible values in one data set and generally insufficient detail documented on data set pre-processing.CONCLUSIONS -it is recommended that researchers (i) indicate the provenance of the data sets they use (ii) report any pre-processing in sufficient detail to enable meaningful replication and (iii) invest effort in understanding the data prior to applying machine learners.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.