Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Khuat, Thanh Tung; Le, My Hanh

doi:10.1007/s42979-020-0119-4

Cited by 25 publications

(19 citation statements)

References 54 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, JIT-SDP has become a research hotspot in the field of defect prediction because of its fine-grained and instant traceability. In the software defect prediction problem, Khuat et al [5] empirically evaluated the importance of sampling various classifier sets of imbalanced data by combining sampling technology and ensemble learning model and predicted positive effects for data with category imbalance problem. Zhu et al [6] proposed a just-in-time defect prediction model DAECNN-JDP based on a denoising autoencoder and convolutional neural network.…”

Section: Just-in-time Software Defect Predictionmentioning

confidence: 99%

Software defect prediction based on nested-stacking and heterogeneous feature selection

Chen

Wang

Song

2022

Complex Intell. Syst.

View full text Add to dashboard Cite

Software testing guarantees the delivery of high-quality software products, and software defect prediction (SDP) has become an important part of software testing. Software defect prediction is divided into traditional software defect prediction and just-in-time software defect prediction (JIT-SDP). However, most of the existing software defect prediction frameworks are relatively simplified, which makes it extremely difficult to provide developers with more detailed reference information. To improve the effectiveness of software defect prediction and realize effective software testing resource allocation, this paper proposes a software defect prediction framework based on Nested-Stacking and heterogeneous feature selection. The framework includes three stages: data set preprocessing and feature selection, Nested-Stacking classifier, and model classification performance evaluation. The novel heterogeneous feature selection and nested custom classifiers in the framework can effectively improve the accuracy of software defect prediction. This paper conducts experiments on two software defect data sets (Kamei, PROMISE), and demonstrates the classification performance of the model through two comprehensive evaluation indicators, AUC, and F1-score. The experiment carried out large-scale within-project defect prediction (WPDP) and cross-project defect prediction (CPDP). The results show that the framework proposed in this paper has an excellent classification performance on the two types of software defect data sets, and has been greatly improved compared with the baseline models.

show abstract

Section: Just-in-time Software Defect Predictionmentioning

confidence: 99%

Software defect prediction based on nested-stacking and heterogeneous feature selection

Chen

Wang

Song

2022

Complex Intell. Syst.

View full text Add to dashboard Cite

show abstract

“…The prediction of defects in software systems is very important and there is great interest in the development of novel high-performance software defect predictors. The purpose of SDP models is to improve the quality of software application systems [15]. Many models have been constructed to recognize the defects in software modules using artificial intelligence and statistical methods [1,18,19,20,21,22].…”

Section: Related Workmentioning

confidence: 99%

“…This study selects imbalanced datasets from the public PROMISE repository for experimental purposes [12,13,14], so this motivates a solution such as applying the sampling methods and there is great interest in building unbiased classifiers that start from imbalanced software defect data. Although several experiments in the previous studies [12,15,16,17] are conducted based on these datasets using many ML models, very few of them are based on CNN and GRU. Even there is no experiment using CNN and GRU combined with oversampling techniques in the literature.…”

Section: Introductionmentioning

confidence: 99%

A Novel Approach for Software Defect Prediction using CNN and GRU Based on SMOTE Tomek Method

Khleel

Nehéz

2022

Preprint

View full text Add to dashboard Cite

Software defect prediction (SDP) plays an important role in enhancing the quality of software projects and reducing maintenance-based risks through the ability to detect defective software components. SDP refers to the methods that use historical defect data to build the relationship between software metrics and software defects. Several prediction models such as machine learning (ML), deep learning (DL) have been developed and adopted to recognize defect in software modules and many methodologies and frameworks have been presented. One of the most difficult problems that these models face in binary classification is the classes imbalance. When the distribution of classes is unbalanced, the accuracy may be high, but the model cannot recognize data instances in the minority class, this will lead to weak classifications. So far, few research have been done in the previous studies that address the problem of class imbalance in SDP. To address the class imbalance problem, we propose a novel SDP approach based on convolutional neural network (CNN) and gated recurrent unit (GRU) combined with synthetic minority oversampling technique plus Tomek link (SMOTE Tomek). To establish the efficiency of the proposed models, the experiments have been conducted on benchmark datasets which obtained from the PROMISE repository and the experimental results have been compared and evaluated in terms of accuracy, precision, recall, f-measure, the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR), mean square error (MSE). The average accuracy of the proposed models on the original datasets were 89% for CNN and 87% for GRU, while the average accuracy of the proposed models on the balanced datasets were 94% for CNN and 92% for GRU. The results showed that the proposed models on the balanced datasets improves the average accuracy by 5% for both models compared to original datasets. This indicates the positive effects of combining ML techniques with data balancing methods on the performance of defect prediction regarding datasets with imbalanced class distributions.

show abstract

“…The longitudes and latitudes are combined. In this study, the outliers were oversampled using the SMOTE [38] algorithm based on the original 50 outliers and expanded it to a total data percentage of 50% with a difference of 5% to test the robustness and efficacy of the LOKI technique. Table 3 shows the proportions and volumes of data added.…”

Section: A Datasetmentioning

confidence: 99%

Unsupervised Outlier Detection Mechanism for Tea Traceability Data

Yang

et al. 2022

IEEE Access

View full text Add to dashboard Cite

The presence of outliers in tea traceability data can mislead customers and have a significant impact on the reputation and profits of tea companies. To solve this problem, an unsupervised outlier detection mechanism for tea traceability data is proposed. Firstly, tea traceability data is uploaded to the MySQL database, and then the data is preprocessed to aggregate features based on relevance, which makes it easier to identify abnormal features. Secondly, the LOKI algorithm based on Local Outlier Factor (LOF), Isolation Forest (IForest), and K-Nearest Neighbors (KNN) algorithms is used to achieve unsupervised outlier detection of tea traceability data. In addition, a Density-Based Spatial Clustering of Applications with Noise (DBSCAN-based) tuning method for unsupervised outlier detection algorithms is also provided. Finally, the types of anomalies among the identified outliers are identified to investigate the causes of the anomalies in order to develop remedial procedures to eliminate the anomalies, and the analysis results are fed back to the tea companies. Experiments on real datasets show that the DBSCAN-based tuning method can effectively help the unsupervised outlier detection algorithm optimize the parameters, and that the LOF-KNN-IForest (LOKI) algorithm can effectively identify the outliers in tea traceability data. This proves that the unsupervised outlier detection mechanism for tea traceability data can effectively guarantee the quality of tea traceability data.

show abstract

Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems

Cited by 25 publications

References 54 publications

Software defect prediction based on nested-stacking and heterogeneous feature selection

Software defect prediction based on nested-stacking and heterogeneous feature selection

A Novel Approach for Software Defect Prediction using CNN and GRU Based on SMOTE Tomek Method

Unsupervised Outlier Detection Mechanism for Tea Traceability Data

Contact Info

Product

Resources

About