Kematian yang disebabkan penyakit jantung masih sangat tinggi, sehingga perlu peningkatan upaya-upaya pencegahannya, misalnya dengan meningkatkan capaian model prediksinya. Penerapan metode-metode machine learning pada dataset publik (Cleveland, Hungary, Switzerland, VA Long Beach, & Statlog) yang umumnya digunakan oleh para peneliti untuk prediksi penyakit jantung, termasuk pengembangan alat bantunya, masih belum menangani missing value, noisy data, unbalanced class, dan bahkan data validation secara efisien. Oleh karena itu, pendekatan imputasi mean/mode diusulkan untuk menangani missing value replacement, Min-Max Normalization untuk menangani smoothing noisy data, K-Fold Cross Validation untuk menangani data validation, dan pendekatan ensemble menggunakan metode Weighted Vote (WV) yang dapat menyatukan kinerja tiap-tiap metode machine learning untuk mengambil keputusan klasifikasi sekaligus untuk mereduksi unbalanced class. Hasil penelitian ini menunjukkan bahwa metode yang diusulkan tersebut memberikan akurasi sebesar 85,21%, sehingga mampu meningkatkan kinerja akurasi metode-metode machine learning, selisih 7,14% dengan Artificial Neural Network, 2,77% dengan Support Vector Machine, 0,34% dengan C4.5, 2,94% dengan Naïve Bayes, dan 3,95% dengan k-Nearest Neighbor.
Breast Cancer is the most common cancer found in women and the death rate is still in second place among other cancers. The high accuracy of the machine learning approach that has been proposed by related studies is often achieved. However, without efficient pre-processing, the model of Breast Cancer prediction that was proposed is still in question. Therefore, this research objective to improve the accuracy of machine learning methods through pre-processing: Missing Value Replacement, Data Transformation, Smoothing Noisy Data, Feature Selection / Attribute Weighting, Data Validation, and Unbalanced Class Reduction which is more efficient for Breast Cancer prediction. The results of this study propose several approaches: C4.5 - Z-Score - Genetic Algorithm for Breast Cancer Dataset with 77,27% accuracy, 7-Nearest Neighbor - Min-Max Normalization - Particle Swarm Optimization for Wisconsin Breast Cancer Dataset - Original with 97,85% accuracy, Artificial Neural Network - Z-Score - Forward Selection for Wisconsin Breast Cancer Dataset - Diagnostics with 98,24% accuracy, and 11-Nearest Neighbor - Min-Max Normalization - Particle Swarm Optimization for Wisconsin Breast Cancer Dataset - Prognostic with 83,33% accuracy. The performance of these approaches is better than standard/normal machine learning methods and the proposed methods by the best of previous related studies.
Naïve Bayes (NB) algorithm is still in the top ten of the Data Mining algorithms because of it is simplicity, efficiency, and performance. To handle classification on numerical data, the Gaussian distribution and kernel approach can be applied to NB (GNB and KNB). However, in the process of NB classifying, attributes are considered independent, even though the assumption is not always right in many cases. Absolute Correlation Coefficient can determine correlations between attributes and work on numerical attributes, so that it can be applied for attribute weighting to GNB (ACW-NB). Furthermore, because performance of NB does not increase in large datasets, so ACW-NB can be a classifier in the local learning model, where other classification methods, such as K-Nearest Neighbor (K-NN) which are very well known in local learning can be used to obtain sub-dataset in the ACW-NB training. To reduction of noise/bias, then missing value replacement and data normalization can also be applied. This proposed method is termed "LL-KNN ACW-NB (Local Learning K-Nearest Neighbor in Absolute Correlation Weighted Naïve Bayes)," with the objective to improve the performance of NB (GNB and KNB) in handling classification on numerical data. The results of this study indicate that the LL-KNN ACW-NB is able to improve the performance of NB, with an average accuracy of 91,48%, 1,92% better than GNB and 2,86% better than KNB.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.