Online Defect Prediction for Imbalanced Data

Tan, Ming; Tan, Lin; Dara, Sashank; Mayeux, Caleb

doi:10.1109/icse.2015.139

Cited by 228 publications

(209 citation statements)

References 48 publications

Supporting

Mentioning

208

Contrasting

Order By: Relevance

“…These studies showed that oversampling approach helps in achieving better prediction performance when dataset is imbalanced. In addition, most of the reported works (e.g., Li & Wang, 2014;Pelayo & Dick, 2007;Shatnawi, 2012;Tan, Tan, Dara, & Mayeux, 2015 ) on software fault prediction have used oversampling approach for generating synthetic values. Due to these reasons, we have used an oversampling approach in this study.…”

Section: Resampling the Training Subsetsmentioning

confidence: 98%

Towards an ensemble based system for predicting the number of software faults

Rathore

Kumar

2017

Expert Systems with Applications

View full text Add to dashboard Cite

Section: Resampling the Training Subsetsmentioning

confidence: 98%

Towards an ensemble based system for predicting the number of software faults

Rathore

Kumar

2017

Expert Systems with Applications

View full text Add to dashboard Cite

“…Using the functionality provided by Spark, creating the data splits as required for the approach by Tan et al [71] as well as the training of the defect prediction model was straightforward. After fetching the data from the MongoDB, a Map job was used to prepare the data.…”

Section: Defect Predictionmentioning

confidence: 99%

“…We selected a change-based defect prediction model based on a recent publication by Tan et al [71]. The approach by Tan et al suggests to use the first part of a project as training data, then leave a gap and predict the remainder of the project using a prediction model trained on the first part of the data.…”

Section: Defect Predictionmentioning

confidence: 99%

Adressing problems with external validity of repository mining studies through a smart data platform

Trautsch

Herbold

Makedonski

et al. 2016

Proceedings of the 13th International Conference on Mining Software Repositories

View full text Add to dashboard Cite

Research in software repository mining has grown considerably the last decade. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the external validity of results. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created the prototype SmartSHARK that implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present Smart-SHARK and discuss our experiences regarding the use of SmartSHARK and the mentioned problems.

show abstract

“…For addressing the class imbalance problem in fault prediction, numerous methods have been developed at data and algorithm levels. Data‐level methods include a variety of resampling techniques, such as random undersampling, random oversampling, and SMOTE (Synthetic Minority Over‐sampling TEchnique) .…”

Section: Related Workmentioning

confidence: 99%

Heterogeneous fault prediction with cost‐sensitive domain adaptation

Jing

Zhu

2018

Software Testing Verif & Rel

View full text Add to dashboard Cite

Summary In the early phases of software testing, projects may have only limited historical defect data. Learning prediction model with such insufficient training data will limit the efficacy of learned predictor. In practice, there are usually many publicly available fault prediction datasets. Recently, heterogeneous fault prediction (HFP) has been proposed. However, existing HFP models do not investigate how to use mixed project data to predict target. Furthermore, defect data are often imbalanced. The imbalanced data distribution of source usually leads to serious misclassification of fault‐prone instances, which will degrade the predictor's performance. Existing HFP methods do not consider the class imbalance problem in the training stages. In this paper, we propose a novel Cost‐sensitive Label and Structure‐consistent Unilateral Projection (CLSUP) approach for HFP. CLSUP can not only make better use of the within‐project and cross‐project data but also alleviate the class imbalance problem by setting different misclassification costs for fault‐prone and non–fault‐prone instances. Extensive experiments on 30 projects demonstrate the effectiveness of CLSUP.

show abstract

Online Defect Prediction for Imbalanced Data

Cited by 228 publications

References 48 publications

Towards an ensemble based system for predicting the number of software faults

Towards an ensemble based system for predicting the number of software faults

Adressing problems with external validity of repository mining studies through a smart data platform

Heterogeneous fault prediction with cost‐sensitive domain adaptation

Contact Info

Product

Resources

About