A feature matching and transfer approach for cross-company defect prediction

Yu, Qiao; Jiang, Shujuan; Zhang, Yanmei

doi:10.1016/j.jss.2017.06.070

Cited by 62 publications

(38 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Yu et al reported that Naï ve Bayes performs better than Logistic regression and KNN when using a 50% training set. The same study also reported that when using 10-fold cross validation, Logistic regression and KNN outperform Naï ve Bayes [12]. One cause of such confounded results is that the number of candidate models is limited.…”

Section: Comparison With Other Studiesmentioning

confidence: 88%

Evaluating Software Defect Prediction Performance: An Updated Benchmarking Study

Lessmann

Baesens

2019

SSRN Journal

View full text Add to dashboard Cite

Accurately predicting faulty software units helps practitioners target faulty units and prioritize their efforts to maintain software quality. Prior studies use machine-learning models to detect faulty software code. We revisit past studies and point out potential improvements. Our new study proposes a revised benchmarking configuration. The configuration considers many new dimensions, such as class distribution sampling, evaluation metrics, and testing procedures. The new study also includes new datasets and models. Our findings suggest that predictive accuracy is generally good. However, predictive power is heavily influenced by the evaluation metrics and testing procedure (frequentist or Bayesian approach). The classifier results depend on the software project. While it is difficult to choose the best classifier, researchers should consider different dimensions to overcome potential bias.

show abstract

Section: Comparison With Other Studiesmentioning

confidence: 88%

Evaluating Software Defect Prediction Performance: An Updated Benchmarking Study

Lessmann

Baesens

2019

SSRN Journal

View full text Add to dashboard Cite

show abstract

“…Sci. 2020, 10, x FOR PEER REVIEW 3 of 15 target project by designing a feature matching algorithm to convert the heterogeneous features into the matched features according to the 'distance' of different distributing curves [12]. Ma et al proposed Kernel Canonical Correlation Analysis based transfer learning algorithm to improve the adaptive ability of prediction model [13].…”

Section: Proposed Frameworkmentioning

confidence: 99%

“…The researchers focused on data processing before transfer learning. Yu et al achieve feature transfer from the source project to the target project by designing a feature matching algorithm to convert the heterogeneous features into the matched features according to the 'distance' of different distributing curves [12]. Ma et al proposed Kernel…”

mentioning

confidence: 99%

Heterogeneous Defect Prediction Based on Transfer Learning to Handle Extreme Imbalance

Jiang

Zhang

et al. 2020

Applied Sciences

View full text Add to dashboard Cite

Software systems are now ubiquitous and are used every day for automation purposes in personal and enterprise applications; they are also essential to many safety-critical and mission-critical systems, e.g., air traffic control systems, autonomous cars, and Supervisory Control And Data Acquisition (SCADA) systems. With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity. Maintaining a high quality of such complex systems while manually keeping the error rate at a minimum is a challenge. This paper proposed a heterogeneous defect prediction method considering class extreme imbalance problem in real software datasets. In the first stage, Sampling with the Majority method (SWIM) based on Mahalanobis Distance is used to balance the dataset to reduce the influence of minority samples in defect data. Due to the negative impact of uncorrelated features on the classification algorithm, the second stage uses ensemble learning and joint similarity measurement to select the most relevant and representative features between the source project and the target project. The third phase realizes the transfer learning from the source project to the target project in the Grassmann manifold space. Our experiments, conducted using nine projects of three public domain software defect libraries and compared with four existing advanced methods to verify the effectiveness of the proposed method in this paper. The experimental results indicate that the proposed method is more accurate in terms of Area under curve (AUC).Appl. Sci. 2020, 10, 396 2 of 15 identification of the defective samples. Although the misclassification of defective samples does not significantly reduce the global classification accuracy, the accuracy of defective samples will decline, which is inconsistent with the goal of software defect prediction. Zhou et al. proposed a model which combined attribute selection, sampling technologies and ensemble algorithm to solve the class imbalance problem [4]. Huda et al. introduced a new mixed sampling strategy to generate more pseudo samples from defective classes, and combined random oversampling, Majority Weighted Minority Oversampling Technique, and Fuzzy-Based Feature-Instance Recovery to construct an integrated classifier [5]. It was proven that the prediction performance of Heterogeneous Defect Prediction (HDP) can be improved by balancing defect dataset.At present, the research on SDP is mainly based on the defect prediction of homogeneous projects, which uses historical data of other projects to construct prediction model. The historical data have the same metrics as the target project, but they are distributed differently. Sufficient historical data are provided for the project to be predicted. However, the programming languages and application fields of different projects are often different, and the corresponding features and distribution are various. It is very difficult to construct a mod...

show abstract

“…This method includes the metric selection phase and metric matching phase. Then Yu et al presented a feature matching method to convert the heterogeneous features into the matched features and presented a feature transfer method to transfer the matched features from the source project to the target project. Jing et al proposed unified metric representation (UMR) for the data of the source project and the target project, then they used canonical correlation analysis (CCA) to make the data distribution similar.…”

Section: Background and Related Workmentioning

confidence: 99%

Do different cross‐project defect prediction methods identify the same defective modules?

Chen

Qu³

et al. 2019

J Software Evolu Process

View full text Add to dashboard Cite

Cross‐project defect prediction (CPDP) is needed when the target projects are new projects or the projects have less training data, since these projects do not have sufficient historical data to build high‐quality prediction models. The researchers have proposed many CPDP methods, and previous studies have conducted extensive comparisons on the performance of different CPDP methods. However, to the best of our knowledge, it remains unclear whether different CPDP methods can identify the same defective modules, and this issue has not been thoroughly explored. In this article, we select 12 state‐of‐the‐art CPDP methods, including eight supervised methods and four unsupervised methods. We first compare the performance of these methods in the same experiment settings on five widely used datasets (ie, NASA, SOFTLAB, PROMISE, AEEEM, and ReLink) and rank these methods via the Scott‐Knott test. Final results confirm the competitiveness of unsupervised methods. Then we perform diversity analysis on defective modules for these methods by using the McNemar test. Empirical results verify that different CPDP methods may lead to difference in the modules predicted as defective, especially when the comparison is performed between the supervised methods and unsupervised methods. Finally, we also find there exist a certain number of defective modules, which cannot be correctly identified by any of the CPDP methods or can be correctly identified by only one CPDP method. These findings can be utilized to design more effective methods to further improve the performance of CPDP.

show abstract

A feature matching and transfer approach for cross-company defect prediction

Cited by 62 publications

References 25 publications

Evaluating Software Defect Prediction Performance: An Updated Benchmarking Study

Evaluating Software Defect Prediction Performance: An Updated Benchmarking Study

Heterogeneous Defect Prediction Based on Transfer Learning to Handle Extreme Imbalance

Do different cross‐project defect prediction methods identify the same defective modules?

Contact Info

Product

Resources

About