Summary
To deal with the huge amount of data, minimizing the overhead will play a key role in speedy and efficient malware detection. We propose a machine learning (ML) malware detection model with preprocessing to limit the feature overhead. The portable‐executable (PE) header information that retains meaningful and distinctive information has been considered to classify benign and malware files. The dataset is preprocessed by applying transformation, outlier detection and filling, and smoothing techniques. A maximum relevance minimum redundancy‐based feature selection method is deployed to assign the rank and score to each feature retaining the maximum relevant and minimal redundant information. Based on the obtained rank, many subsets of features have been created and investigated against support vector machine (SVM) and k‐nearest neighbors (k‐NN) with parametric tuning. The proposed ML model integrated with data preprocessing, feature selection, and SVM‐polynomial classifier has superior performance. This model is eliminating 63.8% feature overhead with accuracy above 99.1% for the benchmark datasets. To examine the robustness of the proposed model, new balanced and imbalanced datasets are created using new malware. The test results are encouraging with accuracy and specificity above 96.68%, 97.65%, and 91.57%, respectively. Interestingly, the proposed model is not trained using the newly created dataset.