When analyzing the defects of crowdsourced testing, the testing reports need to be preprocessed, including removing duplicate and false positives. At present, most crowdsourced testing research focuses on duplication of the reports, which has achieved high accuracy. However, studies on reducing false positives of defects have rarely been conducted. Starting from the duplication of defects in the reports, this paper discusses the relationship between the duplication and the accuracy of defects and proposes an estimation approach based on the defect distribution in historical crowdsourced testing projects. Experiments have shown that our approach can provide a priori knowledge of defects and exhibits good stability. We applied this approach to the defect population estimation of crowdsourced testing. With better experimental results, our improved model is more accurate than the original model. We attempt to apply it to the estimation of the defect population of crowdsourced testing, which is more accurate than the original model.INDEX TERMS Precision of defect duplication; crowdsourced testing; a priori distribution of defect precision.
Estimating the population of defects is a key reference for completion evaluation in crowdsourced testing. Current studies commonly use the biological capture–recapture model (CRC model) to estimate the population of defects in crowdsourced tests with good results. However, the lack of consideration of false negatives for defects in existing studies leads to imprecise estimation of completion in crowdsourced testing, which in turn influences decision support during crowdsourced testing. This study analyzes the basic assumptions of the CRC model in biology and maps them to the field of crowdsourced testing, improves the Lincoln–Peterson estimator and sample coverage estimator utilizing false negative assumptions, and compares them with original estimators in 29 real crowdsourced testing projects. The experimental results show that our improved estimators can effectively reduce the estimation relative error of the population of defects caused by false negatives.
Software defect prediction (SDP) is an important technology which is widely applied to improve software quality and reduce development costs. It is difficult to train the SDP model when software to be test only has limited historical data. Cross-project defect prediction (CPDP) has been proposed to solve this problem by using source project data to train the defect prediction model. Most of CPDP methods build defect prediction models based on the similarity of feature space or data distance between different projects. However, when the target project has a small amount of label data, these methods usually do not consider this part of data information. Therefore, when the distribution between source project and target project is quite different, these methods are difficult to achieve good prediction performance. To solve this problem, this paper proposes a CPDP method based on a semisupervised clustering (namely, Tsbagging). Tsbagging has two stages; in the first stage, we cluster to the source project data based on the limited labeled data in the target project and assign different weights to these source project data according to the clustering results. In the second stage, we use bagging method to train the prediction model based on the weight assigned in the first stage. The experimental results show that the performance achieved by Tsbagging is better than other existing SDP methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.