Cross-project defect prediction (CPDP) methods can be used when the target project is a new project or lacks enough labeled program modules. In these new target projects, we can easily extract and then measure these modules with software measurement tools. However, labeling these program modules is time-consuming, error-prone and requires professional domain knowledge. Moreover, directly using labeled modules in the other projects (i.e., the source projects) can not achieve satisfactory performance due to the large data distribution difference in most cases. In this article, to our best knowledge, we are the first to propose a novel method ALTRA, which can utilize both active learning and TrAdaBoost to alleviate this issue. In particular, we firstly use Burak filter to select similar labeled modules from the source project after analyzing the unlabeled modules in the target project. Then we use active learning to choose representative unlabeled modules from the target project and ask experts to label the type (i.e., defective or non-defective) of these modules. Later, we use TrAdaBoost to determine the weights of labeled modules in the source project and the target project, and then construct the model via weighted support vector machine. After selecting a small number of modules (i.e., only 5% modules) in the target project, we terminate the method ALTRA and return the final constructed model. To show the effectiveness of our proposed method ALTRA, we choose 10 large-scale open-source projects from different application domains. In terms of both F1 and AUC performance indicators, we find ALTRA can perform significantly better than seven state-of-the-art CPDP baselines. Moreover, we also show that the usage of Burak filter, the uncertainty active learning strategy, the class imbalanced learning method and TrAdaBoost are competitive in our proposed method ALTRA.
Security vulnerability prediction (SVP) can construct models to identify potentially vulnerable program modules via machine learning. Two kinds of features from different points of view are used to measure the extracted modules in previous studies. One kind considers traditional software metrics as features, and the other kind uses text mining to extract term vectors as features. Therefore, gathered SVP data sets often have numerous features and result in the curse of dimensionality. In this article, we mainly investigate the impact of filter-based ranking feature selection (FRFS) methods on SVP, since other types of feature selection methods have too much computational cost. In empirical studies, we first consider three real-world large-scale web applications. Then we consider seven methods from three FRFS categories for FRFS and use a random forest classifier to construct SVP models. Final results show that given the similar code inspection cost, using FRFS can improve the performance of SVP when compared with state-of-the-art baselines. Moreover, we use McNemar's test to perform diversity analysis on identified vulnerable modules by using different FRFS methods, and we are surprised to find that almost all the FRFS methods can identify similar vulnerable modules via diversity analysis.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.