A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Redmond, Kimberly; Luo, Lannan; Zeng, Qiang

doi:10.14722/bar.2019.23057

Cited by 39 publications

(27 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1:1 mapping phase that is similar to the function matching research and classification phase using the semantic aware neural network. In the literature, function matching is addressed by using traditional featurebased approaches [1], [7], [3], [8], [2], [5], [4], [9], [10] and also by using the deep learning approaches [11], [12], [13] with an objective to find the similarity between two functions. In contrast, our neural network-based approach aims to find the similarity and differences in two binary functions.…”

Section: Related Workmentioning

confidence: 99%

“…Zuo et al [12] proposed INNEREYE that uses LSTM to treat instructions as words and basic blocks as sentences and train the neural network to compare two basic block embeddings for cross-architecture to predict their similarity score. Redmond et al [13] extends the Zuo [12] work using joint learning approach to generate an instruction embedding. Lie et al [29] uses a combination of distance features to find similarities between two functions.…”

Section: Embedding Structural Featuresmentioning

confidence: 99%

“…In step 2, we compute the cosine distance (line 11) of each one-shot vector (in Bin A ) against all the feature matrix (all one-shot feature vectors) of compared binary (Bin A ). The output is a distance matrix (line 12) and its minimum distance index is considered as the match, if the minimum distance is less than or equal to 0.1 (90% similarity) then the functions at these indexes are mapped (lines [13][14][15][16][17][18]. Once the mapping process between Bin A and Bin A ends then the remaining functions are categorized as no-match (lines 23-27).…”

Section: :1 Mapping Algorithmmentioning

confidence: 99%

See 2 more Smart Citations

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

Ullah

2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Binary diffing is a process to discover the differences and similarities in functionality between two binary programs. Previous research on binary diffing approaches it as a function matching problem to formulate an initial 1:1 mapping between functions, and later a sequence matching ratio is computed to classify two functions being an exact match, a partial match or no-match. The accuracy of existing techniques is best only when detecting exact matches and they are not efficient in detecting partially changed functions; especially those with minor patches. These drawbacks are due to two major challenges (i) In the 1:1 mapping phase, using a strict policy to match function features (ii) In the classification phase, considering an assembly snippet as a normal text, and using sequence matching for similarity comparison. Instruction has a unique structure i.e. mnemonics and registers have a specific position in instruction and also have a semantic relationship, which makes assembly code different from general text. Sequence matching performs best for general text but it fails to detect structural and semantic changes at an instruction level thus, its use for classification produces many false results. In this research, we have addressed the aforementioned underlying challenges by proposing a two-fold solution. For the 1:1 mapping phase, we have proposed computationally inexpensive features, which are compared with distance-based selection criteria to map similar functions and filter unmatched functions. For the classification phase, we have proposed a Siamese binary-classification neural network where each branch is an attention-based distributed learning embedding neural network -that learn the semantic similarity among assembly instructions, learn to highlight the changes at an instruction level and a final stage fully connected layer learn to accurately classify two 1:1 mapped function either an exact or a partial match. We have used x86 kernel binaries for training and achieved ∼99% classification accuracy; which is higher than existing binary diffing techniques and tools.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Embedding Structural Featuresmentioning

confidence: 99%

Section: :1 Mapping Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

Ullah

2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…Zuo et al's work [10] solved the task of binary similarity by converting a basic block into an embedding and measuring the distance between two embeddings, and instructions in a basic block are combined through the use of a Long Short-Term Memory (LSTM). Redmond et al [11] proposed a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within architecture but also their semantic relationships across architectures. SAFE [12], proposed by Massarelli et al, is a general architecture for calculating binary function embeddings starting from disassembled binaries, using a self-attentive recurrent neural network that parses all instructions according to their addresses.…”

Section: ) Nlp-based Binary Code Similarity Detectionmentioning

confidence: 99%

A Neural Network-Based Approach for Cryptographic Function Detection in Malware

Zhou

Jia

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Cryptographic technology has been commonly used in malware for hiding their static characteristics and malicious behaviors to avoid the detection of anti-virus engines and counter the reverse analysis from security researchers. The detection of cryptographic functions in an effective way in malware has vital significance for malicious code detection and deep analysis. Many efforts have been made to solve this issue, while existing methods suffer from some issues, such as unable to achieve promising results in accuracy, limited by prior knowledge, and have a high overhead. In this paper, we draw on the idea of text classification in the field of natural language processing and propose a novel neural network to detect the type of cryptographic functions. The new network is an end-2-end model which includes two important modules: Instruction-2-vec and K-Max-CNN-Attention. The Instruction-2-vec model extracts the ''words'' of assembly instructions and transfers them into continuous vectors. The K-Max-CNN-Attention is used to encode the instruction vectors and generate the representation of the function. And we designed a softmax classifier to predict the categories of the functions. Extensive experiments were conducted on a collected dataset which contains 15 common types of cryptographic functions extracted from malware, to assess the validity of the proposed approach. The experiment results showed that the proposed approach archives a better performance than the recent embedding network SAFE with the Precision, Recall and F1-score of 0.9349, 0.8933 and 0.9020, respectively. We also compared it with four widely-used tools, the results demonstrated that our approach is much better in accuracy and effectiveness than all of them.

show abstract

“…Inspired by NLP(natural language processing), Baldoni et al [21] embed the instructions with word2vec model and optimize the hyperparameters using siamese structure. Redmond et al [22] explore binary instruction embedding across architectures. They convert the binary code to intermediate language and recorded the input/output as signature for comparison.…”

Section: Introductionmentioning

confidence: 99%

BEDetector: A Two-Channel Encoding Method to Detect Vulnerabilities Based on Binary Similarity

Shen

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Applying neural network technology to binary similarity detection has become a promising search topic, and vulnerability detection is an important application field of binary similarity detection. When embedding binary code into matrix by neural network, the problem of feature representation also needs to be solved in vulnerability detection. However, most of the current researches extract the syntax or structural features of binary code, and take basic block as the minimum analysis unit, which is relatively coarse. In addition, the structural features of binary functions are usually represented by the dependency graph. In the embedding process, only the neighbour information of the node can be obtained, ignoring the global information of the graph. To solve these two problems, we propose a two-channel feature extraction method to obtain semantic feature in finer granularity and represent the structural features globally instead of locally. Inspired by natural language process, we propose a contextual semantic feature extraction method to obtain different granularity features of binary functions. It takes instruction as the minimum analysis unit and obtains the semantic relationship between instructions. Meanwhile, in order to represent the structural feature of each function, we propose a neural GAE model instead of the widely used structure2vec model. In this way, we can preserve and reconstruct the control dependencies between the basic blocks in the whole graph. We have implemented a prototype system BEDetector, evaluated the effectiveness of its neural model and compared the accuracy of vulnerability function detection with state-of-the-art system. Besides, we choose the real-world firmware files as the detection target and prove that BEDetector can achieve a relatively high detection rate. BEDetector could reach a precision of 88.8%, 86.7% and 100% when ranking top-50 candidate functions in the detection of the CVE vulnerability function ssl3_get_key_exchange, ssl3_get_new_session_ticket and udhcp_get_option, proving the efficiency of our method.

show abstract

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Cited by 39 publications

References 42 publications

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

A Neural Network-Based Approach for Cryptographic Function Detection in Malware

BEDetector: A Two-Channel Encoding Method to Detect Vulnerabilities Based on Binary Similarity

Contact Info

Product

Resources

About

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Cited by 39 publications

References 42 publications

BinDiffNN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

BinDiffNN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

A Neural Network-Based Approach for Cryptographic Function Detection in Malware

BEDetector: A Two-Channel Encoding Method to Detect Vulnerabilities Based on Binary Similarity

Contact Info

Product

Resources

About

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences