The multiclass imbalanced data problems in data mining were an interesting to study currently. The problems had an influence on the classification process in machine learning processes. Some cases showed that minority class in the dataset had an important information value compared to the majority class. When minority class was misclassification, it would affect the accuracy value and classifier performance. In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems. The first stage, making the decision tree model uses the C5.0 algorithm then the cost sensitive learning uses the metacost method to obtain the minimum cost model. The results of testing the C5.0 algorithm had better performance than C4.5 and ID3 algorithms. The percentage of algorithm performance from C5.0, C4.5 and ID3 were 40.91%, 40, 24% and 19.23%.
Abstrak– Knowledge discovery is the method of extracting information from data in making informed decisions. Seeing as classifiers do have a lot of learning patterns in the data, testing an imbalanced dataset becomes a major classification issue. The cost-sensitive approach on the decision tree C4.5 and nave Bayes is used to solve the rule of misclassification. The glass, lympografi, vehicle, thyroid, and wine datasets were collected from the UCI Repository and included in this analysis. Preprocessing attribute selection with particle swarm optimization was used to process the data collection. Besides, the cost-sensitive decision tree C4.5 and the cost-sensitive naive Bayes method were used in the research. On the glass, lympografi, vehicle, thyroid, and wine datasets, the accuracy of the test results was 72.34 %, 68.22 %, 75.68 %, 93.82 %, and 93.95 %, respectively, using the cost-sensitive decision tree C4.5. While the cost-sensitive naive Bayes method outperforms the others by 32.24 %, 82.61 %, 25.53 %, 97.67 %, and 94.94 % on the dataset, respectively.
Data mining merupakan proses pengolahan data untuk mengambil keputusan secara cepat, tepat dan akurat. Data mining pada bidang kesehatan dan manufacturing menjadi hal yang sangat penting dikarenakan suatu kesalahan klasifikasi (misclassification) akan memiliki dampak serius. Masalah utama pada data mining ketika data yang digunakan bersifat imbalanced multiclass karena classifier kesulitan untuk mengklasifikasikan data sehingga menyebabkan terjadinya misclassification. Solusi untuk meminimalkan missclasification dengan menggunakan metode cost sensitive pada classifier decision tree C5.0 dan naïve bayes. Penelitian ini menggunakan dataset glass, lympografi, vehicle, thyroid dan wine yang diperoleh dari UCI Respository. Kelima dataset dilakukan proses seleksi atribut menggunakan particle swarm optimazation. Kemudian dataset diuji menggunakan metode cost sensitive decision tree C5.0 dan cost sensitive naïve bayes. Hasil pengujian menggunakan metode cost sensitive decision tree C5.0 diperoleh nilai accuracy pada dataset glass, lympografi, vehicle, thyroid dan wine berturut-turut sebesar 76.17%, 83.33%, 75.27%, 95.81% dan 95.83%. Sedangkan metode cost sensitive naïve bayes memiliki performa accuracy pada dataset berturut-turut sebesar 32.24%, 82.61%, 25.53%, 97.67% dan 94.94%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.