Recent advances in the domain of software defect prediction (SDP) include the integration of multiple classification techniques to create an ensemble or hybrid approach. This technique was introduced to improve the prediction performance by overcoming the limitations of any single classification technique. This research provides a systematic literature review on the use of the ensemble learning approach for software defect prediction. The review is conducted after critically analyzing research papers published since 2012 in four well-known online libraries: ACM, IEEE, Springer Link, and Science Direct. In this study, five research questions that cover the different aspects of research progress on the use of ensemble learning for software defect prediction are addressed. To extract the answers to identified questions, 46 most relevant papers are shortlisted after a thorough systematic research process. This study will provide compact information regarding the latest trends and advances in ensemble learning for software defect prediction and provide a baseline for future innovations and further reviews. Through our study, we discovered that frequently employed ensemble methods by researchers are the random forest, boosting, and bagging. Less frequently employed methods include stacking, voting and Extra Trees. Researchers proposed many promising frameworks, such as EMKCA, SMOTE-Ensemble, MKEL, SDAEsTSE, TLEL, and LRCR, using ensemble learning methods. The AUC, accuracy, F-measure, Recall, Precision, and MCC were mostly utilized to measure the prediction performance of models. WEKA was widely adopted as a platform for machine learning. Many researchers showed through empirical analysis that feature selection and data sampling were important pre-processing steps that improve the performance of ensemble classifiers.
Persistent and quality graduation rates of students are increasingly important indicators of progressive and effective educational institutions. Timely analysis of students' data to guide instructors in the provision of academic interventions to students who are at risk of performing poorly in their courses or dropout is vital for academic achievement. In addition there is need for performance attributes relationship mining for the generation of comprehensible patterns. However, there is dearth in pieces of knowledge relating to predicting students' performance from patterns. This therefore paper adopts hierarchical cluster analysis (HCA) to analyze students' performance dataset for the discovery of optimal number of fail courses clusters and partitioning of the courses into groups, and association rule mining for the extraction of interesting course-status association. Agglomerative HCA with Ward's linkage method produced the best clustering structure (five clusters) with a coefficient of 92% and silhouette width 0.57. Apriori algorithm with support (0.5%), confidence (80%) and lift (1) thresholds were used in the extraction of rules with student's status as consequent. Out of the twenty one courses offered by students in the first year, seven courses frequently occur together as failed courses, and their impact on the respective students' performance status were assessed in the rules. It is conjectured that early intervention by the instructors and management of educational activities on these seven courses will increase the students' learning outcomes leading to increased graduation rate at minimum course duration, which is the overarching objective of higher educational institutions.
Predicting the defects at early stage of software development life cycle can improve the quality of end product at lower cost. Machine learning techniques have been proved to be an effective way for software defect prediction however an imbalance dataset of software defects is the main issue of lower and biased performance of classifiers. This issue can be resolved by applying the re-sampling methods on software defect dataset before the classification process. This research analyzes the performance of three widely used resampling techniques on class imbalance issue for software defect prediction. The resampling techniques include: "Random Under Sampling", "Random Over Sampling" and "Synthetic Minority Oversampling Technique (SMOTE)". For experiments, 12 publically available cleaned NASA MDP datasets are used with 10 widely used supervised machine learning classifiers. The performance is evaluated through various measures including: F-measure, Accuracy, MCC and ROC. According to results, most of the classifiers performed better with "Random Over Sampling" technique in many datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.