Summary
To deal with the huge amount of data, minimizing the overhead will play a key role in speedy and efficient malware detection. We propose a machine learning (ML) malware detection model with preprocessing to limit the feature overhead. The portable‐executable (PE) header information that retains meaningful and distinctive information has been considered to classify benign and malware files. The dataset is preprocessed by applying transformation, outlier detection and filling, and smoothing techniques. A maximum relevance minimum redundancy‐based feature selection method is deployed to assign the rank and score to each feature retaining the maximum relevant and minimal redundant information. Based on the obtained rank, many subsets of features have been created and investigated against support vector machine (SVM) and k‐nearest neighbors (k‐NN) with parametric tuning. The proposed ML model integrated with data preprocessing, feature selection, and SVM‐polynomial classifier has superior performance. This model is eliminating 63.8% feature overhead with accuracy above 99.1% for the benchmark datasets. To examine the robustness of the proposed model, new balanced and imbalanced datasets are created using new malware. The test results are encouraging with accuracy and specificity above 96.68%, 97.65%, and 91.57%, respectively. Interestingly, the proposed model is not trained using the newly created dataset.
The portable executable header (PEH) information is commonly used as a feature for malware detection systems to train and validate machine learning (ML) or deep learning (DL) classifiers. We propose to extract the deep features from the PEH information through hidden layers of a feed-forward deep neural network (FFDNN). The extraction of deep features of hidden layers represents the dataset with a better generalization for malware detection. While feeding the deep feature of one hidden layer to the succeeding layer, the Gaussian error linear unit (GeLU) activation function is applied. The FFDNN is trained with the GeLU activation function using the deep features of individual layers as well as concatenated deep features of all hidden layers. Similarly, the ML classifiers are also trained and validated in with individual layer deep features and concatenated features. Three highly effective ML classifiers, random forest (RF), support vector machine (SVM), and k-nearest neighbour (k-NN) have been investigated. The performance of the proposed model is demonstrated using a statically significant large dataset. The obtained results are interesting and encouraging in terms of classification accuracy. The classification accuracy reaches 99.15% with the internal discriminative deep feature for the proposed FFDNN-ML classifier with the GeLU activation function.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.