BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware

Yang, Limin; Ciptadi, Arridhana; Laziuk, Ihar; Ahmadzadeh, Ali; Wang, Gang

doi:10.1109/spw53761.2021.00020

Cited by 83 publications

(76 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During our first experiments, we noticed that malware from different open datasets were not equal from an evasion perspective: some were quite simple to evade, whereas others from the BODMAS Malware Dataset [11] (created and maintained by Blue Hexagon and UIUC) led to some evasion difficulties. We suppose that these differences may be related to the age of the malware: the BODMAS dataset is quite recent as it contains 57,293 malware samples collected from August 2019 to September 2020.…”

Section: Methodsmentioning

confidence: 99%

“…The advances in the fields of artificial intelligence, ML, and deep learning make it possible to improve malware detection, and classification [7,8]. In particular, some notable datasets have been made publicly available, such as Ember [9], SOREL-20M [10] or recently BODMAS [11]. These open datasets motivate new works, help in resolving existing challenges, and are very useful to benchmark new research proposals.…”

Section: Background and Related Workmentioning

confidence: 99%

“…We want to thank the following colleagues Daniel Juteau, Philippe Calvet, Sok-yen Loui and Adam Ouorou. We are very grateful to BODMAS team for their valuable dataset [11].…”

Section: Acknowledgmentsmentioning

confidence: 99%

See 2 more Smart Citations

MERLIN -- Malware Evasion with Reinforcement LearnINg

Quertier¹,

Marais²,

Morucci³

et al. 2022

Preprint

View full text Add to dashboard Cite

In addition to signature-based and heuristics-based detection techniques, machine learning (ML) is widely used to generalize to new, never-before-seen malicious software (malware). However, it has been demonstrated that ML models can be fooled by tricking the classifier into returning the incorrect label. These studies, for instance, usually rely on a prediction score that is fragile to gradient-based attacks. In the context of a more realistic situation where an attacker has very little information about the outputs of a malware detection engine, modest evasion rates are achieved [1]. In this paper, we propose a method using reinforcement learning with DQN and REINFORCE algorithms to challenge two state-of-the-art ML-based detection engines (MalConv & EMBER) and a commercial antivirus (AV) classified by Gartner as a leader AV [2]. Our method combines several actions, modifying a Windows portable execution (PE) file without breaking its functionalities. Our method also identifies which actions perform better and compiles a detailed vulnerability report to help mitigate the evasion. We demonstrate that REINFORCE achieves very good evasion rates even on a commercial AV with limited available information.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MERLIN -- Malware Evasion with Reinforcement LearnINg

Quertier¹,

Marais²,

Morucci³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Malware [63] This is a collection of malicious files from several malware-based datasets such as the Genome Project, VirusTotal, Virus Share, Comodo, Contagio, Microsoft and DREBIN. These datasets are commonly used for data-driven malware analysis and evaluation of existing malware detection systems utilizing machine learning techniques.…”

Section: Cic-ddos2019 [61]mentioning

confidence: 99%

Cybersecurity Threats and Their Mitigation Approaches Using Machine Learning—A Review

Ahsan

Nygard

Gomes

et al. 2022

JCP

View full text Add to dashboard Cite

Machine learning is of rising importance in cybersecurity. The primary objective of applying machine learning in cybersecurity is to make the process of malware detection more actionable, scalable and effective than traditional approaches, which require human intervention. The cybersecurity domain involves machine learning challenges that require efficient methodical and theoretical handling. Several machine learning and statistical methods, such as deep learning, support vector machines and Bayesian classification, among others, have proven effective in mitigating cyber-attacks. The detection of hidden trends and insights from network data and building of a corresponding data-driven machine learning model to prevent these attacks is vital to design intelligent security systems. In this survey, the focus is on the machine learning techniques that have been implemented on cybersecurity data to make these systems secure. Existing cybersecurity threats and how machine learning techniques have been used to mitigate these threats have been discussed. The shortcomings of these state-of-the-art models and how attack patterns have evolved over the past decade have also been presented. Our goal is to assess how effective these machine learning techniques are against the ever-increasing threat of malware that plagues our online community.

show abstract

“…To reduce the inference latency, we further integrate the two models into a novel file-size-aware twostage framework. We assessed our proposed designs on three datasets, BIG 2015 [22], and two datasets derived from the BODMAS PE malware dataset [23], BODMAS-11 and BODMAS-49. Based on these experiments, our paper makes the following contributions:…”

Section: Introductionmentioning

confidence: 99%

Self-Attentive Models for Real-Time Malware Classification

Zhang²,

Kinawi³

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Malware classification is a critical task in cybersecurity, as it offers insights into the threats that malware poses to the victim device and helps in the design of countermeasures. For realtime malware classification, due to the large amount of potential malware present in the network, there is a challenge of achieving high classification accuracy while maintaining low inference latency. We first introduce two self-attention transformer-based classifiers, SeqConvAttn and ImgConvAttn, to replace the currently predominant CNN-based classifiers. We then devise a file-size-aware two-stage framework to combine the two proposed models, thereby controlling the tradeoff between accuracy and latency for realtime classification. To assess our proposed designs, we conduct experiments on three malware datasets, the Microsoft Malware Classification Challenge (BIG 2015) and two selected subsets from the BODMAS PE malware dataset, BODMAS-11 and BODMAS-49. We show that our transformer-based designs can achieve better classification accuracy than traditional CNN-based designs. Furthermore, we show that the proposed two-stage framework can reduce the average model inference latency while maintaining superior accuracy, thereby fulfilling the requirements of real-time classification.

show abstract

BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware

Cited by 83 publications

References 13 publications

MERLIN -- Malware Evasion with Reinforcement LearnINg

MERLIN -- Malware Evasion with Reinforcement LearnINg

Cybersecurity Threats and Their Mitigation Approaches Using Machine Learning—A Review

Self-Attentive Models for Real-Time Malware Classification

Contact Info

Product

Resources

About