Conventional approaches to tackling malware attacks have proven to be futile at detecting never-before-seen (zero-day) malware. Research however has shown that zero-day malicious files are mostly semantic-preserving variants of already existing malware, which are generated via obfuscation methods. In this paper we propose and evaluate a machine learning based malware detection model using ensemble approach. We employ a strategy of ensemble where multiple feature sets generated from different n-gram sizes of opcode sequences are trained using a single classifier. Model predictions on the trained multi feature sets are weighted and combined on average to make a final verdict on whether a binary file is malicious or benign. To obtain optimal weight combination for the ensemble feature sets, we applied a grid search on a set of pre-defined weights in the range 0 to 1. With a balanced dataset of 2000 samples, an ensemble of n-gram opcode sequences of n sizes 1 and 2 with respective weight pair 0.3 and 0.7 yielded the best detection accuracy of 98.1% using random forest (RF) classifier. Ensemble n-gram sizes 2 and 3 obtained 99.7% as best precision using weight 0.5 for both models.
With the record of the highest market share of mobile operating systems, the Android operating system has become a prime target for cyber perpetrators as malicious applications are leveraged as attack vectors to exploit Android systems. Machine learning detection solutions that have become a resort mostly rely on handcrafted features, a process deemed to be laborious and time-consuming. In this article, we employ a deep learning-based model consisting of 1-dimensional convolutional neural network (1D CNN) to automate the detection of Android malware. Our choice of 1D CNN was motivated by the computational advantage of 1D convolution operations over 2D CNN. The proposed model automatically extracts features from semantically embedded n-grams of raw static operation code (opcodes) sequences to determine the maliciousness of a binary file. Predictions of the 1D CNN model trained on multiple feature sets of n-gram opcode sequences are combined using a weighted average ensemble. Optimal prediction weights were obtained using a grid search on values in the range 0 to 1. With an Android dataset comprising 4951 malware and 2477 benign samples, our model yielded a positive predictive value of 98% and sensitivity of 97% using a weight parity of 0.5 for ensemble unigram and bigram opcode sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.