2022
DOI: 10.20944/preprints202209.0103.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PDF Malware Detection Based on Optimizable Decision Trees

Abstract: Portable Document Format (PDF) files are one of the most universally used file types. This has fascinated hackers to develop methods to use these normally innocent PDF files to create security threats via infection vectors PDF files. This is usually realized by hiding embedded malicious code in the victims’ PDF documents to infect their machines. This, of course, results in PDF Malware and requires techniques to identify benign files from malicious files. Research studies indicated that machine-learn… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 42 publications
0
2
0
Order By: Relevance
“…In this section, we review some of the researchers' work in this field and the techniques that were used with the results they obtained. The research in [28] used CNN-LSTM To address serious shortcomings in malware detection the suggested CNN-LSTM method is employed for the early identification of malware accuracy is 99%.The accuracy of the other classifiers is 98% for DT and 95% for SVM. The LSTM model's accuracy is 99%, its recall accuracy is 99%, and its F1 score is 1.…”
Section: Machine Learning In Malware Detectionmentioning
confidence: 99%
“…In this section, we review some of the researchers' work in this field and the techniques that were used with the results they obtained. The research in [28] used CNN-LSTM To address serious shortcomings in malware detection the suggested CNN-LSTM method is employed for the early identification of malware accuracy is 99%.The accuracy of the other classifiers is 98% for DT and 95% for SVM. The LSTM model's accuracy is 99%, its recall accuracy is 99%, and its F1 score is 1.…”
Section: Machine Learning In Malware Detectionmentioning
confidence: 99%
“…have presented a comprehensive survey of Malware detection approaches between signature-based, behavior-based and Machine learning based, listing the pros and cons of each approach offering the different Malware obfuscation mechanisms, concluding that signature-based is the most efficient in detecting well-known malicious codes, while behavior-based is more effective in detecting more complex Malwares and zero-day Malwares. Authors of[8] focused on the most recent PDF-Malware detection techniques, including the PDF feature extraction and analysis and surveying the variety of detection approaches including statistical analysis, which may focus on byte0level comparison between Malicious and benign PDFs to detect Malware for example, another methods of detection could be signature matching and ML classification, Marwan Albahar In[9] applied support vector machine (SVM) and convolutional neural network (CNN) machine learning classifiers on two PDF Malware datasets retrieved from Virus Total [10] consist of 10,603 malicious files (collected Dec./2017) and Contagio[11] dataset consist of approximately 20,000 malicious and benign files (collected Nov./2017) , the two models achieved around 100% accuracy for both classifiers.Other past work focused on feature extraction methods, for instance authors in[12] stressed on enhancing PDF Malware classification by identifying the most valuable features in PDF file using two PDF analysis tools PeePDF and PDFiD then evaluated those features via a wrapper function decreasing the feature set by 60% resulting in top 14 features extracted from PDF documents, and evaluating the classification on a dataset retrieved from VirusTotal using multiple ML classifiers Random-Forest, SVM,DNN and C50 Decision Tree, achieving a max accuracy of 96.8% Couple of researches focused on the same dataset CIC-Evasive-PDFMal2022[13], authors of[14] presented a new detection system utilizing "Adaboost" decision tree algorithm, splitting the data into training and testing dataset with 80:20 achieving accuracy of around 98.8% and a low PDF Malware detection overhead in respect to timing of couple of micro seconds, meanwhile in[15] researches used two datasets of Contagio and CIC-Evasive-PDFMal2022 on two different machine learning classifiers of Long Short Term Memory S-LSTM and IWO-S-LSTM achieving an accuracy of 97.06% and 98.20% on our concerned dataset of CIC-Evasive-PDFMal2022.…”
mentioning
confidence: 99%