2021
DOI: 10.32604/cmc.2021.018260
|View full text |Cite
|
Sign up to set email alerts
|

Toward Robust Classifiers for PDF Malware Detection

Abstract: Malicious Portable Document Format (PDF) files represent one of the largest threats in the computer security space. Significant research has been done using handwritten signatures and machine learning based on detection via manual feature extraction. These approaches are time consuming, require substantial prior knowledge, and the list of features must be updated with each newly discovered vulnerability individually. In this study, we propose two models for PDF malware detection. The first model is a convoluti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…The CNN model achieved superior performance compared to traditional machine learning classifiers including SVM, Decision Tree, Naive Bayes, and Random Forest. Albahar et al [20] presented two learning-based models for detection of malicious PDFs and experimented on 30,797 infected and benign documents collected from the Contagio dataset and VirusTotal. Their first model was a CNN model that used tree-based PDF file structure as features and yielded 99.33% accuracy; the second model was an ensemble SVM model with different kernels which used n-gram with object content encoding as features and yielded an accuracy of 97.3%.…”
Section: Related Workmentioning
confidence: 99%
“…The CNN model achieved superior performance compared to traditional machine learning classifiers including SVM, Decision Tree, Naive Bayes, and Random Forest. Albahar et al [20] presented two learning-based models for detection of malicious PDFs and experimented on 30,797 infected and benign documents collected from the Contagio dataset and VirusTotal. Their first model was a CNN model that used tree-based PDF file structure as features and yielded 99.33% accuracy; the second model was an ensemble SVM model with different kernels which used n-gram with object content encoding as features and yielded an accuracy of 97.3%.…”
Section: Related Workmentioning
confidence: 99%
“…have presented a comprehensive survey of Malware detection approaches between signature-based, behavior-based and Machine learning based, listing the pros and cons of each approach offering the different Malware obfuscation mechanisms, concluding that signature-based is the most efficient in detecting well-known malicious codes, while behavior-based is more effective in detecting more complex Malwares and zero-day Malwares. Authors of[8] focused on the most recent PDF-Malware detection techniques, including the PDF feature extraction and analysis and surveying the variety of detection approaches including statistical analysis, which may focus on byte0level comparison between Malicious and benign PDFs to detect Malware for example, another methods of detection could be signature matching and ML classification, Marwan Albahar In[9] applied support vector machine (SVM) and convolutional neural network (CNN) machine learning classifiers on two PDF Malware datasets retrieved from Virus Total [10] consist of 10,603 malicious files (collected Dec./2017) and Contagio[11] dataset consist of approximately 20,000 malicious and benign files (collected Nov./2017) , the two models achieved around 100% accuracy for both classifiers.Other past work focused on feature extraction methods, for instance authors in[12] stressed on enhancing PDF Malware classification by identifying the most valuable features in PDF file using two PDF analysis tools PeePDF and PDFiD then evaluated those features via a wrapper function decreasing the feature set by 60% resulting in top 14 features extracted from PDF documents, and evaluating the classification on a dataset retrieved from VirusTotal using multiple ML classifiers Random-Forest, SVM,DNN and C50 Decision Tree, achieving a max accuracy of 96.8% Couple of researches focused on the same dataset CIC-Evasive-PDFMal2022[13], authors of[14] presented a new detection system utilizing "Adaboost" decision tree algorithm, splitting the data into training and testing dataset with 80:20 achieving accuracy of around 98.8% and a low PDF Malware detection overhead in respect to timing of couple of micro seconds, meanwhile in[15] researches used two datasets of Contagio and CIC-Evasive-PDFMal2022 on two different machine learning classifiers of Long Short Term Memory S-LSTM and IWO-S-LSTM achieving an accuracy of 97.06% and 98.20% on our concerned dataset of CIC-Evasive-PDFMal2022.…”
mentioning
confidence: 99%