An Exploratory Study of Search Based Training Data Selection for Cross Project Defect Prediction

Hosseini, Seyedrebvar; Turhan, Burak

doi:10.1109/seaa.2018.00048

Cited by 5 publications

(4 citation statements)

References 26 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• recall [86] • accuracy [129] • probability of false positive [65] • true negative rate [121] • balance [95] • ROC-AUC [76] • F-Measure [71] • G-Measure [56] • H-Measure [57] • G-Mean [82] • Win/Draw/Loss [59] • false negative rate [121] • Matthews correlation coefficient [121] FIGURE 13. Evaluation measures frequency Figure 13 shows that precision, recall, ROC-AUC, and F-measure are the preferred evaluation measures.…”

Section: E Rq5: Which Are the Evaluation Measures Applied To Cpdp Mod...mentioning

confidence: 99%

Cross-Project Defect Prediction: A Literature Review

Pal

Sillitti²

2022

IEEE Access

View full text Add to dashboard Cite

Background: Software defect prediction models aim at identifying the potential faulty modules of a software project based on historical data collected from previous versions of the same project. Due to the lack of availability of software engineering data from the same project, the researchers proposed crossproject defect prediction (CPDP) models where the data collected from one or more projects are used to predict faults in other project. There are a number of approaches proposed with different levels of success and very limited repeatability. Goals: The purpose of this paper is to investigate the existing studies of cross-project models for defect prediction. It synthesizes the literature focusing on characteristics such as project type, software metrics, data preprocessing techniques, features selection approaches, classifiers, and performance measures used. Method: This paper follows the well-known Systematic Literature Review (SLR) approach proposed by Barbara Kitchenham in 2007. Results: Our finding shows that most of the article was published between 2015 and 2021. Moreover, the studies are mostly based on open-source datasets and the software metrics used to create the models are mainly product metrics. We also found out that most studies attempted to improve their models improving data preprocessing and feature selection approaches. Furthermore, logistic regression followed by naive bayes and random forest are the most adopted classifier techniques in such models. Finally, the f-measure followed by recall and AUC are the most preferred evaluation measure used to evaluate the performance of the models. Conclusions: This study provides an overview of the different approaches used to improve the CPDP models analyzing the different techniques used for data preprocessing, feature selection, and the selection of the classifiers. Moreover, we identified some aspects that need further investigation.

show abstract

Section: E Rq5: Which Are the Evaluation Measures Applied To Cpdp Mod...mentioning

confidence: 99%

Cross-Project Defect Prediction: A Literature Review

Pal

Sillitti²

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…F-Measure is selected in this study as the basis of our selection of the best approach. Additionally, a combination of F-measure and GMean, i.e., F×GMean is used further as the basis for fitness assignment for LSH parameter tuning, learning technique hyper-parameter tuning, [5,6,25].…”

Section: Performance Measures and Toolsmentioning

confidence: 99%

A comparison of similarity based instance selection methods for cross project defect prediction

Hosseini

Turhan

2021

Proceedings of the 36th Annual ACM Symposium on Applied Computing

Self Cite

View full text Add to dashboard Cite

Context: Previous studies have shown that training data instance selection based on nearest neighborhood (NN) information can lead to better performance in cross project defect prediction (CPDP) by reducing heterogeneity in training datasets. However, neighborhood calculation is computationally expensive and approximate methods such as Locality Sensitive Hashing (LSH) can be as effective as exact methods. Aim: We aim at comparing instance selection methods for CPDP, namely LSH, NN-filter, and Genetic Instance Selection (GIS). Method: We conduct experiments with five base learners, optimizing their hyper parameters, on 13 datasets from PROMISE repository in order to compare the performance of LSH with benchmark instance selection methods NN-Filter and GIS. Results: The statistical tests show six distinct groups for F-measure performance. The top two group contains only LSH and GIS benchmarks whereas the bottom two groups contain only NN-Filter variants. LSH and GIS favor recall more than precision. In fact, for precision performance only three significantly distinct groups are detected by the tests where the top group is comprised of NN-Filter variants only. Recall wise, 16 different groups are identified where the top three groups contain only LSH methods, four of the next six are GIS only and the bottom five contain only NN-Filter. Finally, NN-Filter benchmarks never outperform the LSH counterparts with the same base learner, tuned or non-tuned. Further, they never even belong to the same rank group, meaning that LSH is always significantly better than NN-Filter with the same learner and settings. Conclusions: The increase in performance and the decrease in computational overhead and runtime make LSH a promising approach. However, the performance of LSH is based on high recall and in environments where precision is considered more important NN-Filter should be considered.

show abstract

“…Researchers build their prediction models based on software metrics derived from source code repository (e.g., Change metrics [7], CK metrics [20], Object-oriented metrics [9]) using machine learning classifiers (e.g., Naive Bayes [21], Support Vector Machine [22], Decision Tree [23], Random Forest [24]) to classify faulty and non-faulty modules. The main challenge of CPDP is to reduce data divergence between source and target projects data sets.…”

Section: A Statistical and Machine Learning Based Cpdpmentioning

confidence: 99%

Generative Adversarial Network-based Cross-Project Fault Prediction

Pal

2021

Preprint

View full text Add to dashboard Cite

Background: The early stage of defect prediction in the software development life cycle can reduce testing effort and ensure the quality of software. Due to the lack of historical data within the same project, Cross-Project Defect Prediction (CPDP) has become a popular research topic among researchers. CPDP trained classifiers based on labeled data sets of one project to predict fault in another project. Goals: Software Defect Prediction (SDP) data sets consist of manually designed static features, which are software metrics. In CPDP, source and target project data divergence is the major challenge in achieving high performance. In this paper, we propose a Generative Adversarial Network (GAN)-based data transformation to reduce data divergence between source and target projects. Method: We apply the Generative Adversarial Method where label data sets are choosing as real data, while target data sets are choosing as fake data. The Discriminator tries to measure the perfection of domain adaptation through loss function. Through the generator, target data sets try to adapt the source project domain and, finally, apply machine learning classifier (i.e., Naive Bayes) to classify faulty modules. Results: Our result shows that it is possible to predict defects based on the Generative Adversarial Method. Our model performs quite well in a cross-project environment when we choose JDT as a target data sets. However, all chosen data sets are facing a large class imbalance problem which affects the performance of our model.

show abstract

An Exploratory Study of Search Based Training Data Selection for Cross Project Defect Prediction

Cited by 5 publications

References 26 publications

Cross-Project Defect Prediction: A Literature Review

Cross-Project Defect Prediction: A Literature Review

A comparison of similarity based instance selection methods for cross project defect prediction

Generative Adversarial Network-based Cross-Project Fault Prediction

Contact Info

Product

Resources

About