Software defect prediction suffers from the class-imbalance. Solving the class-imbalance is more important for improving the prediction performance. SMOTE is a useful over-sampling method which solves the class-imbalance. In this paper, we study about some problems that faced in software defect prediction using SMOTE algorithm. We perform experiments for investigating how they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and compare the performance of classifiers. We use paired t-test to test the statistical significance of results. Also, we introduce the effectiveness and ineffectiveness of over-sampling, and evaluation criteria for evaluating if an over-sampling is effective or not. We use those concepts to evaluate the results in accordance with the evaluation criteria for the effectiveness of over-sampling. The results show that they, the percentage of appended minority class and the number of nearest neighbors, influence the prediction performance, and show that the over-sampling by SMOTE is effective in several classifiers.
All papers cite references, but not all citations are equal. This is because references have different citation-mention frequencies. Some references are mentioned only once, while some are mentioned several times. Papers are cited by others when they are relevant to the citing paper. The fact that a reference has a high citation-mention frequency may mean that its content is more closely related to the citing paper. From this point of view, we examined the relevancy of a cited paper to a citing paper on the basis of citation-mention frequency. Two aspects of relevancy are considered: citation linkage and content. We construct a highly mentioned class of references and a rarely mentioned class of references. We introduce the concepts of ''reference-similarity'' and ''content-similarity.'' First, we count the number of co-cited references and calculate the reference-similarity. Second, we extract the abstracts of papers and calculate the content-similarity using the bag-of-words model. The results show that references from the highly mentioned class are more relevant to the citing papers than those from the rarely mentioned class.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.