Bilgi teknolojileri varlıklarının hem bireylerin günlük hayatlarındaki hem de kurum ve kuruluşların işleyişindeki yeri son çeyrek asırda hızlı bir artış göstermiş, bu artışa paralel olarak bilgi varlıklarına yönelik tehditler de artmıştır. Zararlı yazılımlar, bilgi varlıklarına yönelik başlıca tehditlerden biridir. Sürekli olarak kendini yenileyen zararlı yazılımlara karşı geleneksel tespit yaklaşımlarının yetersiz kalması sebebiyle, makine öğrenmesi modelleri kullanan tespit yaklaşımları geliştirilmiştir. Bu çalışmada, zararlı yazılım tespiti maksadıyla kullanılan farklı makine öğrenme algoritmalarının çeşitli büyük veri teknolojileri ve platformları üzerinde ortaya koydukları performanslar incelendi. Modeller, Kaggle Zararlı Yazılım Tespiti veri seti kullanılarak eğitildi. En iyi doğruluk (%98.8), kesinlik (%98.5), f1 skoru (%98.2) ve yanlış pozitif oranı (%2) performansları Google Colaboratory ortamında Sci-Kit Learn kütüphanesi ile çalıştırılan rastgele orman modeli ile elde edildi.
One of the most common types of threats to the digital world is malicious software. It is of great importance to detect and prevent existing and new malware before it damages information assets. Machine learning approaches are used effectively for this purpose. In this study, we present a model in which supervised and unsupervised learning algorithms are used together. Clustering is used to enhance the prediction performance of the supervised classifiers. The aim of the proposed model is to make predictions in the shortest possible time with high accuracy and f1 score. In the first stage of the model, the data are clustered with the k-means algorithm. In the second stage, the prediction is made with the combination of the classifier with the best prediction performance for the related cluster. While choosing the best classifiers for the given clusters, triple combinations of ten machine learning algorithms (kernel support vector machine, k-nearest neighbor, naïve Bayes, decision tree, random forest, extra gradient boosting, categorical boosting, adaptive boosting, extra trees, and gradient boosting) are used. The selected triple classifier combination is positioned in two stages. The prediction time of the model is improved by positioning the classifier with the slowest prediction time in the second stage. The selected triple classifier combination is positioned in two tiers. The prediction time of the model is improved by positioning the classifier with the highest prediction time in the second tier. It is seen that clustering before classification improves prediction performance, which is presented using Blue Hexagon Open Dataset for Malware Analysis (BODMAS), Elastic Malware Benchmark for Empowering Researchers (EMBER) 2018 and Kaggle malware detection datasets. The model has 99.74% accuracy and 99.77% f1 score for the BODMAS dataset, 99.04% accuracy and 98.63% f1 score for the Kaggle malware detection dataset, and 96.77% accuracy and 96.77% f1 score for the EMBER 2018 dataset. In addition, the tiered positioning of classifiers shortened the average prediction time by 76.13% for the BODMAS dataset and 95.95% for the EMBER 2018 dataset. The proposed method's prediction performance is better than the rest of the studies in the literature in which BODMAS and EMBER 2018 datasets are used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.