A widespread global health concern among women is the incidence of the second most leading cause of fatality which is breast cancer. Predicting the occurrence of breast cancer based on the risk factors will pave the way to an early diagnosis and an efficient treatment in a quicker time. Although there are many predictive models developed for breast cancer in the past, most of these models are generated from highly imbalanced data. The imbalanced data is usually biased towards the majority class but in cancer diagnosis, it is crucial to diagnose the patients with cancer correctly which are oftentimes the minority class. This study attempts to apply three different class balancing techniques namely oversampling (Synthetic Minority Oversampling Technique (SMOTE)), undersampling (SpreadSubsample) and a hybrid method (SMOTE and SpreadSubsample) on the Breast Cancer Surveillance Consortium (BCSC) dataset before constructing the supervised learning methods. The algorithms employed in this study include Naïve Bayes, Bayesian Network, Random Forest and Decision Tree (C4.5). The balancing method which yields the best performance across all the four classifiers were tested using the validation data to determine the final predictive model. The performances of the classifiers were evaluated using a Receiver Operating Characteristic (ROC) curve, sensitivity, and specificity.
Survivability of patients suffering from breast cancer varies according to the stages. The early detection of breast cancer increase the longevity of patients. However, the number of risk factors involved in the detection exponentially increases with the medical examinations. The need for automated data mining techniques to enable cost-effective and early prediction of cancer is rapidly becoming a trend in healthcare industry. The optimal techniques for prediction and diagnosis differs significantly due to the risk factors. This study reviews article provides a holistic view of the types of data mining techniques used in prediction of breast cancer. On a whole, the computer-aided automatic data mining techniques that are commonly employed in diagnosis and prognosis of chronic diseases include Decision Tree, Naï ve Bayes, Association rule, Multilayer Perceptron (MLP), Random Forest, and Support Vector Machines (SVM), among others. The accuracy and overall performance of the classifiers differ for every dataset and thereby this article attempts to provide a mean to understand the approaches involved in the early prediction.
Graph Data mining has ushered into new era with advanced data mining techniques. Mining Frequent Sub Graphs is the crucial area which appeals the ease of extracting the patterns in the graph. Typical graph data like Social Networks, Biological Networks (for metabolic pathways) and Computer Networks needs analysis of virtual networks of a category. Such graphs need be modeled as layered to distinguish the categories of relationships. Traditional Market Basket Analysis of Data mining has proven its elegance of mining Frequent Itemsets. Combining the techniques of Apriori with Collaborative Mining discriminates a new concept of mining FSG.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.