An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

He, Peng; He, Yao; Yu, Lvjun; Li, Bing

doi:10.1155/2018/2650415

Cited by 16 publications

(19 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Peng He et al [19] developed TD selector using defects and similarity as a weighted function. He utilized logistic regression as a classifier model and analyzed the effects of several combinations of normalization and similarity of defects on the performance of prediction.…”

Section: Literature Surveymentioning

confidence: 99%

Cross-projects software defect prediction using spotted hyena optimizer algorithm

2020

View full text Add to dashboard Cite

Cross-projects software defect prediction improves the quality of new software projects or projects with a shortage of historical data. Therefore, various data mining techniques are recommended in this field. The classification accuracy issue is considered one of the most significant problems due to the shortage and heterogeneous in historical data. To address this challenge, this research utilizes a spotted hyena optimizer algorithm as a classifier to predict defects through cross-projects. Confidence and Support are utilized as a multi-objective fitness function to look for the best classification rules. These classification rules are used to predict defects for new projects or other projects with insufficient data. The datasets of NASA such as JM1, KC1, and KC2 are used. By applying spotted hyena optimizer algorithm as a classifier on one dataset and predicting defects in the other two datasets, accuracy is reported 84.6, 92.0, 82.4, 90.7, 86.6 and 81.8 for JM1, KC1, and KC2 respectively. These accuracy values are better than the most significant data mining techniques in the field such as Support Vector Machine, Naïve Bayes, Boosting, C4.5, and Bagging. Also, the proposed research discusses other performance measures such as precision, recall, and f-measure. The conclusion proves that there are many features of McCabe and Halstead that have a strong impact to generate highly accurate predictors for defects such as McCabe's line count of code, McCabe's cyclomatic complexity, McCabe's essential complexity, McCabe's design complexity iv, Halstead's effort, Halstead's time estimator, Halstead's line count, Halstead's count of line of comments and total operators.

show abstract

Section: Literature Surveymentioning

confidence: 99%

Cross-projects software defect prediction using spotted hyena optimizer algorithm

2020

View full text Add to dashboard Cite

show abstract

“…Therefore, we drop the duplicate instances to keep them unique in source data. Different software metrics are usually with different magnitude and several studies 49,50 indicated that simple normalization can improve prediction performance. Therefore, we use Z‐score normalization 56 to scale each metric of the unique source data and the target data to have mean 0 and standard deviation 1 as the previous papers 19,60 did.…”

Section: Research Approach: Wiflfmentioning

confidence: 99%

WIFLF: An approach independent of the target project for cross‐project defect prediction

Cui

Liu

Wang

2022

J Software Evolu Process

View full text Add to dashboard Cite

Cross-project defect prediction (CPDP) is used to build defect prediction models when data from the target project are not enough. There has been several approaches to improve the performance of CPDP, such as feature transformation and instance selection methods. However, existing techniques are strongly dependent on the target data to reduce the distribution discrepancy between source and target projects. That is, the performance of these methods is determined by the effectiveness of feature transformation or the similarity between two projects. Additionally, when there is a large amount of source data that needs to be matched with target data, it will take much time and reduce the efficiency of model construction.Therefore, it is vital to explore a target project-agnostic approach to build CPDP models. This paper presents a Weighted Isolation Forest with class Label information Filter (WIFLF) to relieve the issues above. Four groups of datasets from AEEEM, Relink and PROMISE Data Repository are used to conduct CPDP models. Besides, WIFLF is compared with 12 approaches. The experimental results indicate that WIFLF significantly outperforms all the baselines. Specifically, WIFLF with random forest significantly improves the performance over the baselines on average by at least 14.64% and 4.90% with respect to Skewed F-Measure and G-Measure, respectively.

show abstract

“…Finally, Logistic Regression was used for prediction. He et al [ 24 ] simplified the training set by TDSelector method and then classified it by Logistic Regression. Sun et al [ 25 ] proposed a near-some source project selection by collaborative filtering (CFPS) method to filter source items, which has good results using SMO and Random Forest as classifiers.…”

Section: Related Workmentioning

confidence: 99%

Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification

Xing

Lin

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

As the typical application of computational intelligence in software engineering, cross-project defect prediction (CPDP) uses labeled data from other projects (source projects) for building models to predict the defects in the current projects (target projects), helping testers quickly locate the defective modules. But class imbalance and different data distribution among projects make CPDP a challenging topic. To address the above two problems, we propose a two-phase feature importance amplification (TFIA) CPDP model in this paper which can solve these two problems from domain adaptation phase and classification phase. In the domain adaptation phase, the differences in data distribution among projects are reduced by filtering both source and target projects, and the correlation-based feature selection with greedy best-first search amplifies the importance of features with strong feature-class correlation. In the classification phase, Random Forest works as the classifier to further amplify the importance of highly correlated features and establish a model which is sensitive to highly correlated features. We conducted both ablation experiments and comparison experiments on the widely used AEEEM database. Experimental results show that TFIA can yield significant improvement on CPDP. And the performance of TFIA CPDP model in all experiments is stable and efficient, which lays a solid foundation for its further application in practical engineering.

show abstract

An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

Cited by 16 publications

References 44 publications

Cross-projects software defect prediction using spotted hyena optimizer algorithm

Cross-projects software defect prediction using spotted hyena optimizer algorithm

WIFLF: An approach independent of the target project for cross‐project defect prediction

Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification

Contact Info

Product

Resources

About