Toward Robust Classifiers for PDF Malware Detection

Albahar, Marwan Ali; Thanoon, Mohammed I.; Alzilai, Monaj; Alrehily, Alaa; Alfaar, Munirah; Algamdi, Maimoona; Alassaf, Norah

doi:10.32604/cmc.2021.018260

Cited by 3 publications

(2 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CNN model achieved superior performance compared to traditional machine learning classifiers including SVM, Decision Tree, Naive Bayes, and Random Forest. Albahar et al [20] presented two learning-based models for detection of malicious PDFs and experimented on 30,797 infected and benign documents collected from the Contagio dataset and VirusTotal. Their first model was a CNN model that used tree-based PDF file structure as features and yielded 99.33% accuracy; the second model was an ensemble SVM model with different kernels which used n-gram with object content encoding as features and yielded an accuracy of 97.3%.…”

Section: Related Workmentioning

confidence: 99%

Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents

Yerima

Bashar

2023

Electronics

View full text Add to dashboard Cite

PDF has become a major attack vector for delivering malware and compromising systems and networks, due to its popularity and widespread usage across platforms. PDF provides a flexible file structure that facilitates the embedding of different types of content such as JavaScript, encoded streams, images, executable files, etc. This enables attackers to embed malicious code as well as to hide their functionalities within seemingly benign non-executable documents. As a result, a large proportion of current automated detection systems are unable to effectively detect PDF files with concealed malicious content. To mitigate this problem, a novel approach is proposed in this paper based on ensemble learning with enhanced static features, which is used to build an explainable and robust malicious PDF document detection system. The proposed system is resilient against reverse mimicry injection attacks compared to the existing state-of-the-art learning-based malicious PDF detection systems. The recently released EvasivePDFMal2022 dataset was used to investigate the efficacy of the proposed system. Based on this dataset, an overall classification accuracy greater than 98% was observed with five ensemble learning classifiers. Furthermore, the proposed system, which employs new anomaly-based features, was evaluated on a reverse mimicry attack dataset containing three different types of content injection attacks, i.e., embedded JavaScript, embedded malicious PDF, and embedded malicious EXE. The experiments conducted on the reverse mimicry dataset showed that the Random Committee ensemble learning model achieved 100% detection rates for embedded EXE and embedded JavaScript, and 98% detection rate for embedded PDF, based on our enhanced feature set.

show abstract

Section: Related Workmentioning

confidence: 99%

Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents

Yerima

Bashar

2023

Electronics

View full text Add to dashboard Cite

show abstract

“…have presented a comprehensive survey of Malware detection approaches between signature-based, behavior-based and Machine learning based, listing the pros and cons of each approach offering the different Malware obfuscation mechanisms, concluding that signature-based is the most efficient in detecting well-known malicious codes, while behavior-based is more effective in detecting more complex Malwares and zero-day Malwares. Authors of[8] focused on the most recent PDF-Malware detection techniques, including the PDF feature extraction and analysis and surveying the variety of detection approaches including statistical analysis, which may focus on byte0level comparison between Malicious and benign PDFs to detect Malware for example, another methods of detection could be signature matching and ML classification, Marwan Albahar In[9] applied support vector machine (SVM) and convolutional neural network (CNN) machine learning classifiers on two PDF Malware datasets retrieved from Virus Total [10] consist of 10,603 malicious files (collected Dec./2017) and Contagio[11] dataset consist of approximately 20,000 malicious and benign files (collected Nov./2017) , the two models achieved around 100% accuracy for both classifiers.Other past work focused on feature extraction methods, for instance authors in[12] stressed on enhancing PDF Malware classification by identifying the most valuable features in PDF file using two PDF analysis tools PeePDF and PDFiD then evaluated those features via a wrapper function decreasing the feature set by 60% resulting in top 14 features extracted from PDF documents, and evaluating the classification on a dataset retrieved from VirusTotal using multiple ML classifiers Random-Forest, SVM,DNN and C50 Decision Tree, achieving a max accuracy of 96.8% Couple of researches focused on the same dataset CIC-Evasive-PDFMal2022[13], authors of[14] presented a new detection system utilizing "Adaboost" decision tree algorithm, splitting the data into training and testing dataset with 80:20 achieving accuracy of around 98.8% and a low PDF Malware detection overhead in respect to timing of couple of micro seconds, meanwhile in[15] researches used two datasets of Contagio and CIC-Evasive-PDFMal2022 on two different machine learning classifiers of Long Short Term Memory S-LSTM and IWO-S-LSTM achieving an accuracy of 97.06% and 98.20% on our concerned dataset of CIC-Evasive-PDFMal2022.…”

mentioning

confidence: 99%

PDF Malware Detection using Machine learning

AlMahadeen¹,

Alkasassbeh²

2023

Preprint

View full text Add to dashboard Cite

Portable Document Format (PDF) is one of the most widely used files types worldwide in data exchange, this has encourage hackers to utilize such files to spread any malicious content through PDF, utilizing different methods and techniques to accomplish that, on the other hand, security researches kept trying to improve detection methods to cope up to the rapidly increasing number of malwares daily, one of the commonly used detection technique nowadays is by utilizing artificial intelligence and Machine learning classificat; thision to help detecting PDF Malwares, in this paper, we utilize machine learning classifier Random Forest on a newly released PDF Malware dataset CIC-Evasive-PDFMal2022 to achieve the main goal of detecting malicious PDF documents, results showing a detection accuracy of around 99.5%

show abstract