Large-scale malware classification using random projections and neural networks

Dahl, George E.; Stokes, Jack W.; Deng, Li; Yu, Dong

doi:10.1109/icassp.2013.6638293

Cited by 341 publications

(175 citation statements)

References 12 publications

Supporting

Mentioning

169

Contrasting

Order By: Relevance

“…In our experiments the logistic regression classifier outperformed Naive Bayes, SVM and Decision Trees implementations from [16], verifying a high performance previously observed in [6]. The schema for our approach is shown in Figure 2.…”

Section: Building An Ensemblesupporting

confidence: 82%

“…We consider [6], where random projections were used to reduce the feature space (sparse binary features, API trigrams and API calls) to classify Windows malware on a dataset of several million samples, to be the highestimpact contribution to the dimensionality reduction problem in malware classification. Although their work is not directly dealing with Android malware, we consider this publication to be very relevant due to its tackling a similar large-scale classification problem.…”

Section: Related Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Android Malware Detection: Building Useful Representations

Sayfullina

Eirola

Komashinsky

et al. 2016

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

Abstract-The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size.In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features.In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections.With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude; this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers.We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.

show abstract

Section: Building An Ensemblesupporting

confidence: 82%

Section: Related Literature Reviewmentioning

confidence: 99%

Android Malware Detection: Building Useful Representations

Sayfullina

Eirola

Komashinsky

et al. 2016

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

show abstract

“…Thus, there is a need to analyze every new malware sample to see if it comes from an already known malware family or represents a new breed of malware. This malware classification problem belongs to the data mining domain, and hence considerable research efforts have been made to apply machine learning techniques such as classification [3], [4], clustering [5], [6], Artificial Neural Networks (ANNs) [7], Hidden Markov Models (HMMs) [8], [9], etc. to solve this problem.…”

Section: Introductionmentioning

confidence: 99%

Similarity-Based Malware Classification Using Hidden Markov Model

Imran

Afzal

Qadir

2015

2015 Fourth International Conference on Cyber Security, Cyber Warfare, and Digital Forensic (CyberSec)

View full text Add to dashboard Cite

The problem of malware classification has gained the attention of cyber security community due to the following facts: (1) thousands of new malware are generated every day (2) the global losses caused by malware are in billions of dollars every year. In this paper a novel malware classification scheme is proposed that is based on Hidden Markov Models (HMMs) and discriminative classifiers. Sequences of system calls generated by malware during execution are represented as observation sequences to train the HMMs. Individual malware samples are then evaluated against these models to generate similarity vectors, which are used to predict the class label for an unknown sample by training a discriminative classifier. Our novel combination of HMMs, dynamic program features and discriminative classifier has shown promising results in experiments performed using system call logs of real malware.

show abstract

“…To the best of our knowledge, the only attempt at training DNNs on randomly projected data, and therefore the approach that is most relevant to our fixed-weight RP layers, was presented in [46]. Therein, Dahl et al used randomly projected data as input to networks trained for the malware classification task.…”

Section: Related Workmentioning

confidence: 99%

Training neural networks on high-dimensional data using random projection

Wójcik

Kurdziel

2018

Pattern Anal Applic

View full text Add to dashboard Cite

Training deep neural networks (DNNs) on high-dimensional data with no spatial structure poses a major computational problem. It implies a network architecture with a huge input layer, which greatly increases the number of weights, often making the training infeasible. One solution to this problem is to reduce the dimensionality of the input space to a manageable size, and then train a deep network on a representation with fewer dimensions. Here, we focus on performing the dimensionality reduction step by randomly projecting the input data into a lower-dimensional space. Conceptually, this is equivalent to adding a random projection (RP) layer in front of the network. We study two variants of RP layers: one where the weights are fixed, and one where they are fine-tuned during network training. We evaluate the performance of DNNs with input layers constructed using several recently proposed RP schemes. These include: Gaussian, Achlioptas', Li's, subsampled randomized Hadamard transform (SRHT) and Count Sketch-based constructions. Our results demonstrate that DNNs with RP layer achieve competitive performance on high-dimensional real-world datasets. In particular, we show that SRHT and Count Sketch-based projections provide the best balance between the projection time and the network performance.

show abstract

Large-scale malware classification using random projections and neural networks

Cited by 341 publications

References 12 publications

Android Malware Detection: Building Useful Representations

Android Malware Detection: Building Useful Representations

Similarity-Based Malware Classification Using Hidden Markov Model

Training neural networks on high-dimensional data using random projection

Contact Info

Product

Resources

About