Binary code traceability of multigranularity information fusion from the perspective of software genes

Huang, Yizhao; Qiao, Meng; Li, Xingwei; Gui, Hairen

doi:10.1016/j.cose.2022.102607

Cited by 6 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These methods often rely on extensive labeled data, a requirement our approach reduces by applying the K-Nearest Neighbors (KNN) algorithm, allowing for effective classification with fewer labeled samples. Adding to the graph-based analysis landscape, Huang et al [11] proposed a multi-granularity fusion feature based on biological gene concepts for binary code traceability. Similarly, Zhao et al [12] introduced a malware homology identification method using subgraphs of the Function Dependence Graph (FDG) as genes.…”

Section: Related Workmentioning

confidence: 99%

The Homology Determination System for APT Samples Based on Gene Maps

Xu,

Di,

Shou

et al. 2024

JCSANDM

View full text Add to dashboard Cite

At present, there are fewer types of homology determination methods for advanced persistent threat (APT) samples detection, and most existing determination schemes have problems such as high cost, low accuracy, and difficulty in identifying unknown APT samples. Therefore, we proposed a homology determination system for APT samples based on gene maps by integrating deep learning and gene maps. Firstly, we extract the software gene features from the samples uploaded by the user and apply the TF-IDF algorithm to clean the extracted software genes. The Word2Vec algorithm is used to vectorize all the genes to construct the gene sample vectors. And we use a LSTM-based classifier to detect APT attack samples. Finally, the K-nearest neighbor algorithm is used to determine the homology of gene-sharing APT samples. The detailed construction process of the scheme is given in this paper, including APT sample gene extraction, cleaning, clustering, sample detection, and homology determination. Experimental validation showcases our model outperforming existing methodologies with an accuracy of 95%, precision of 94%, and recall of 95%. When compared to previous models, the superiority of our approach is evident. These results underscore our model’s high efficiency and accuracy, confirming its potential for significant application in the field of cybersecurity.

show abstract

Section: Related Workmentioning

confidence: 99%

The Homology Determination System for APT Samples Based on Gene Maps

Xu,

Di,

Shou

et al. 2024

JCSANDM

View full text Add to dashboard Cite

show abstract

“…The graph convolutional network (GCN) [25][26][27] is a model that performs convolution operations on graphs. Marcheggiani et al [28] and Huang et al [29] demonstrated that sequence models and GCNs have complementary modeling capabilities; therefore, based on the instruction sequence vector obtained earlier, the GCN is used to fuse the edge information between basic blocks into block-level information. Based on this basic block intermediate representation vector, the main discussion is how to extract the jump relationship information between the CFG basic block nodes and generate basic block embeddings.…”

Section: Gcn-based Basic Block Embeddingmentioning

confidence: 99%

AAPFE: Aligned Assembly Pre-Training Function Embedding for Malware Analysis

Gui¹,

Tang²,

Shan³

et al. 2022

Electronics

Self Cite

View full text Add to dashboard Cite

The use of natural language processing to analyze binary data is a popular research topic in malware analysis. Embedding binary code into a vector is an important basis for building a binary analysis neural network model. Current solutions focus on embedding instructions or basic block sequences into vectors with recurrent neural network models or utilizing a graph algorithm on control flow graphs or annotated control flow graphs to generate binary representation vectors. In malware analysis, most of these studies only focus on the single structural information of the binary and rely on one corpus. It is difficult for vectors to effectively represent the semantics and functionality of binary code. Therefore, this study proposes aligned assembly pre-training function embedding, a function embedding scheme based on a pre-training aligned assembly. The scheme creatively applies data augmentation and a triplet network structure to the embedding model training. Each sub-network extracts instruction sequence information using the self-attention mechanism and basic block graph structure information with the graph convolution network model. An embedding model is pre-trained with the produced aligned assembly triplet function dataset and is subsequently evaluated against a series of comparative experiments and application evaluations. The results show that the model is superior to the state-of-the-art methods in terms of precision, precision ranking at top N (p@N), and the area under the curve, verifying the effectiveness of the aligned assembly pre-training and multi-level information extraction methods.

show abstract

“…Graph neural networks mainly include graph convolution networks (GCN) [20,21], graph attention networks (GAT) [22], graph autoencoders (GAE), [23] etc. In the field of semantic representation of binary codes, Qiao et al [24] and Massarelli et al [25] use GCN for semantic embedding of functions, but considering that the CFG of functions is a directed graph, which will cause a certain loss of the structural information.…”

Section: Gatmentioning

confidence: 99%

SROBR: Semantic Representation of Obfuscation-Resilient Binary Code

Tang¹,

Shan²,

Liu³

et al. 2022

Wireless Communications and Mobile Computing

Self Cite

View full text Add to dashboard Cite

With the rapid development of information technology, the scale of software has increased exponentially. Binary code similarity detection technology plays an important role in many fields, such as detecting software plagiarism, vulnerabilities discovery, and copyright solution issues. Nevertheless, what cannot be ignored is that a variety of approaches to binary code semantic representation have been introduced recently, but few can catch up with existing code obfuscation techniques due to their maturing and extensive development. In order to solve this problem, we propose a new neural network model, named SROBR, which is a deep integration of natural language processing model and graph neural network. In SROBR, BERT is applied to capture sequence information of the binary code at the first place, and then GAT is utilized to capture the structural information. It combines natural language processing and graph neural network, which can capture the semantic information of binary programs while resisting obfuscation options in a more efficient way. Through binary code similarity detection task and obfuscated option classification task, the experimental results demonstrate that SROBR outperforms existing binary similarity detection methods in resisting obfuscation techniques.

show abstract

Binary code traceability of multigranularity information fusion from the perspective of software genes

Cited by 6 publications

References 20 publications

The Homology Determination System for APT Samples Based on Gene Maps

The Homology Determination System for APT Samples Based on Gene Maps

AAPFE: Aligned Assembly Pre-Training Function Embedding for Malware Analysis

SROBR: Semantic Representation of Obfuscation-Resilient Binary Code

Contact Info

Product

Resources

About