SAFE: Self-Attentive Function Embeddings for Binary Similarity

Massarelli, Luca; Luna, Giuseppe Antonio Di; Petroni, Filippo; Baldoni, Roberto; Querzoni, Leonardo

doi:10.1007/978-3-030-22038-9_15

Cited by 132 publications

(175 citation statements)

References 23 publications

Supporting

Mentioning

174

Contrasting

Order By: Relevance

“…We performed a further comparison with the SAFE architecture proposed by us in a previous paper [29]. SAFE does not use the CFG but a self-attentive recurrent neural network that parses all instructions according to their addresses.…”

Section: Discussionmentioning

confidence: 99%

“…• we discuss our findings in Section VI. We note that despite taking into account the syntactic structure of code using the CFG our techniques underperform or have comparable performances, on both task, when compared with a solution [29] that examine sequentially all the disassembled instructions, without information on the control flow given by the CFG. We discuss our hypothesis on this phenomena, giving a possible explanation on the shortcomings of blindly embedding the CFG.…”

Section: Introductionmentioning

confidence: 92%

See 1 more Smart Citation

Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis

Massarelli¹,

Luna²,

Petroni³

et al. 2019

Proceedings 2019 Workshop on Binary Analysis Research

Self Cite

View full text Add to dashboard Cite

In this paper we investigate the use of graph embedding networks, with unsupervised features learning, as neural architecture to learn over binary functions. We propose several ways of automatically extract features from the control flow graph (CFG) and we use the structure2vec graph embedding techniques to translate a CFG to a vectors of real numbers. We train and test our proposed architectures on two different binary analysis tasks: binary similarity, and, compiler provenance. We show that the unsupervised extraction of features improves the accuracy on the above tasks, when compared with embedding vectors obtained from a CFG annotated with manually engineered features (i.e., ACFG proposed in [39]). We additionally compare the results of graph embedding networks based techniques with a recent architecture that do not make use of the structural information given by the CFG, and we observe similar performances. We formulate a possible explanation of this phenomenon and we conclude identifying important open challenges.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 92%

Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis

Massarelli¹,

Luna²,

Petroni³

et al. 2019

Proceedings 2019 Workshop on Binary Analysis Research

Self Cite

View full text Add to dashboard Cite

show abstract

“…Also, it does not consider any program-wide CFG structural information during analysis. SAFE [41] leverages a self-attentive neural network to generate function embeddings.…”

Section: A Code Similarity Detectionmentioning

confidence: 99%

DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing

Duan¹,

Li²,

Wang³

et al. 2020

Proceedings 2020 Network and Distributed System Security Symposium

127

View full text Add to dashboard Cite

Binary diffing analysis quantitatively measures the differences between two given binaries and produces fine-grained basic block level matching. It has been widely used to enable different kinds of critical security analysis. However, all existing program analysis and machine learning based techniques suffer from low accuracy, poor scalability, coarse granularity, or require extensive labeled training data to function. In this paper, we propose an unsupervised program-wide code representation learning technique to solve the problem. We rely on both the code semantic information and the program-wide control flow information to generate basic block embeddings. Furthermore, we propose a khop greedy matching algorithm to find the optimal diffing results using the generated block embeddings. We implement a prototype called DEEPBINDIFF and evaluate its effectiveness and efficiency with a large number of binaries. The results show that our tool outperforms the state-of-the-art binary diffing tools by a large margin for both cross-version and cross-optimization-level diffing. A case study for OpenSSL using real-world vulnerabilities further demonstrates the usefulness of our system.

show abstract

“…Besides, manually feature engineering needs a lot of domain knowledge of assembly code, which is not friendly for most researchers. To address the above issues, static word representation based methods are applied to program language processing in recent works [4], [6]- [8]. In these works, tokens in the basic block, like operators (opcodes) and operands, are represented as fixed-dimension vectors.…”

Section: A Basic Block Embeddingmentioning

confidence: 99%

Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture

Zhang¹,

Sun²,

Pang³

et al. 2020

Proceedings 2020 Workshop on Binary Analysis Research

View full text Add to dashboard Cite

Basic block similarity analysis is a fundamental technique in many machine learning-based binary program analysis methods. The key to basic block similarity analysis is mapping the semantic information of the basic block to a fixeddimension vector, which is the so-called basic block embedding. However, existing solutions to basic block embedding suffer from two major limitations. 1) The basic block embedding contains limited semantic information; 2) they are only applicable to a single instruction set architecture (ISA). To overcome these limitations, we propose a cross-ISA oriented solution for basic block embedding which utilizes an NMT (Neural Machine Translation) model to establish the connection between two ISAs. The proposed embedding model can powerfully map rich semantics of basic blocks from arbitrary ISAs into fixed-dimension vectors. Several measures have been taken to further improve the embedding model. To guide the embedding model to a better state, we creatively use the pretrained model to generate hard negative samples. To promote the effectiveness of the proposed embedding model, we propose a reasonable assembly instruction normalization method in the data preprocessing phase, which is shown to outperform the previous methods. A similarity metric method is then derived and a million-scale dataset is presented to train and evaluate this method. To the best of our knowledge, this is the first million-scale dataset in this field. We implement a prototype system MIRROR. The experimental results show that MIRROR significantly outperforms the representative baseline in the respect that the basic block embeddings, i.e., the vectors, are more distinguishable to discriminate between similar basic blocks and dissimilar ones, and as a result, MIRROR can obtain obviously more accurate evaluation results. The significance of pre-training, the effectiveness of the proposed negative sampling method, and the instruction normalization method have also been justified in experiments.

show abstract

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Cited by 132 publications

References 23 publications

Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis

Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis

DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing

Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture

Contact Info

Product

Resources

About