Malware Detection in PDF Files using Machine Learning

Cuan, Bonan; Damien, Aliénor; Delaplace, Claire; Valois, Mathieu

doi:10.5220/0006884704120419

Cited by 10 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first step involves acquiring the dataset containing malware detection data. For this purpose, publicly available data was collected, specifically utilizing the dataset for the classification of malware with PE Headers sourced from github.com [11]. This dataset serves as the foundation for classifying PE files into two categories: malware and benign The dataset used in the study is extensive, comprising over 138,000 instances.…”

Section: Research Methods 21 Research Designmentioning

confidence: 99%

Optimizing Malware Detection Using Back Propagation Neural Network and Hyperparameter Tuning

Siregar,

Soim,

Fadhli

2023

IJAIDM

View full text Add to dashboard Cite

The escalating growth of the internet has led to an increase in cyber threats, particularly malware, posing significant risks to computer systems and networks. This research addresses the challenge of developing sophisticated malware detection systems by optimizing the Back Propagation Neural Network (BPNN) with hyperparameter tuning. The specific focus is on fine-tuning essential hyperparameters, including dropout rate, number of neurons in hidden layers, and number of hidden layers, to enhance the accuracy of malware detection. A Back Propagation Neural Network (BPNN) with dropout regularization is trained on an extensive dataset as part of the research design. Hyperparameter optimization is conducted using GridSearchCV, with experiments varying learning rates and epochs. The best configuration achieves outstanding results, with 98% accuracy, precision, recall, and F1-score. The proposed approach presents an efficient and reliable solution to bolster cybersecurity systems against malware threats.

show abstract

Section: Research Methods 21 Research Designmentioning

confidence: 99%

Optimizing Malware Detection Using Back Propagation Neural Network and Hyperparameter Tuning

Siregar,

Soim,

Fadhli

2023

IJAIDM

View full text Add to dashboard Cite

show abstract

“…Using a gradient-descent (GD) approach, the naive SVM used by the authors in [37] was easily deceived by us. The authors also devised defenses against this assault by setting a threshold over each considered feature.…”

Section: Literature Reviewmentioning

confidence: 99%

PDF Malware Detection Based on Optimizable Decision Trees

2022

View full text Add to dashboard Cite

Portable document format (PDF) files are one of the most universally used file types. This has incentivized hackers to develop methods to use these normally innocent PDF files to create security threats via infection vector PDF files. This is usually realized by hiding embedded malicious code in the victims’ PDF documents to infect their machines. This, of course, results in PDF malware and requires techniques to identify benign files from malicious files. Research studies indicated that machine learning methods provide efficient detection techniques against such malware. In this paper, we present a new detection system that can analyze PDF documents in order to identify benign PDF files from malware PDF files. The proposed system makes use of the AdaBoost decision tree with optimal hyperparameters, which is trained and evaluated on a modern inclusive dataset, viz. Evasive-PDFMal2022. The investigational assessment demonstrates a lightweight and accurate PDF detection system, achieving a 98.84% prediction accuracy with a short prediction interval of 2.174 μSec. To this end, the proposed model outperforms other state-of-the-art models in the same study area. Hence, the proposed system can be effectively utilized to uncover PDF malware at a high detection performance and low detection overhead.

show abstract

“…Since Portable Document Format files can include a variety of harmful material, including embedded scripts, exploits, and malicious URLs, it can be difficult to detect malware in them. A reading flaw might be used by malware software to try to infect a machine [2]. Adobe Acrobat Reader discovered a huge number of vulnerabilities in 2017.…”

Section: Introductionmentioning

confidence: 99%

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Khan,

Arshad,

Shah Khan

2023

Journal of Cyber Security

View full text Add to dashboard Cite

The proliferation of maliciously coded documents as file transfers increase has led to a rise in sophisticated attacks. Portable Document Format (PDF) files have emerged as a major attack vector for malware due to their adaptability and wide usage. Detecting malware in PDF files is challenging due to its ability to include various harmful elements such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs different testing criteria, namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving an accuracy of 99.8599% using 10-fold cross-validation. The findings highlight the effectiveness of ML models in accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious activities. KEYWORDSCyber-security; PDF malware; model training; testing to the research team members, Bilal Khan (BK), Muhammad Arshad (MA), and Sarwar Shah Khan (SSK), for their collaboration and valuable insights throughout the research process. We extend our appreciation to the institutions that supported this research work. We are deeply grateful to the Canadian Institute for Cyber-security for providing the dataset that formed the foundation of our analysis. Their contributions have been instrumental in enabling us to conduct this study on PDF malware detection and assess the performance of various ML models. Our heartfelt appreciation goes to all the individuals and institutions that reviewed and provided constructive feedback on this research paper. Your valuable input helped improve the quality and rigor of our work. Without the collective efforts and support of all these individuals and organizations, this research paper would not have been possible. Thank you all for your invaluable contributions.

show abstract

Malware Detection in PDF Files using Machine Learning

Cited by 10 publications

References 5 publications

Optimizing Malware Detection Using Back Propagation Neural Network and Hyperparameter Tuning

Optimizing Malware Detection Using Back Propagation Neural Network and Hyperparameter Tuning

PDF Malware Detection Based on Optimizable Decision Trees

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Contact Info

Product

Resources

About