jTrans: jump-aware transformer for binary code similarity detection

Wang, Hao; Qu, Wenjie; Katz, Gilad; Zhu, Wenyu; Gao, Zeyu; Qiu, Han; Zhuge, Jing; Zhang, Chao

doi:10.1145/3533767.3534367

Cited by 68 publications

(22 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many existing binary-to-binary SCA techniques [32,56,70] integrate advanced embedding-based approaches to detect code similarity between binaries and further identify the reused libraries based on the SCA database. Specifically, they leverage deep neural network models to embed binary functions into the representation of vectors and perform binary code clone detection by measuring the similarity between function embeddings [11,40,58,68]. Apart from basic syntactic features, these techniques typically capture semantic features such as the control flow graph (CFG) for each binary function to strengthen their accuracy of code clone detection and the downstream SCA task.…”

Section: Background and Motivation 21 Software Composition Analysismentioning

confidence: 99%

“…In this way, their similarity can be calculated using their corresponding embeddings. Typical code representation learning allows only one single code format of the matched objects, i.e., either source-tosource [16,37,38,49,61] or binary-to-binary [28,35,39,58,67] code matching. However, for binary source code matching, C/C++ language features (e.g., function inlining [23]) and compiler optimization (e.g., code motion [30]) can lead to substantial differences between binary code and source code, and such disparity can be rather challenging when designing BinaryAI.…”

Section: Embedding-based Function Retrievalmentioning

confidence: 99%

“…To obtain a large number of matched binaryto-source function pairs as positive samples for training the model, we construct the automatic compilation pipeline based on official ArchLinux packages [5] and Arch User Repository (AUR) [6] following the insight from BinaryCorp in jTrans [58]. Specifically, we apply the command makepkg to compile all the ArchLinux packages and AUR automatically.…”

Section: Training Datasetmentioning

confidence: 99%

See 2 more Smart Citations

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

Jiang,

An,

Huang

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

While third-party libraries (TPLs) are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis (SCA), proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA

show abstract

Section: Background and Motivation 21 Software Composition Analysismentioning

confidence: 99%

Section: Embedding-based Function Retrievalmentioning

confidence: 99%

Section: Training Datasetmentioning

confidence: 99%

See 1 more Smart Citation

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

Jiang,

An,

Huang

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

show abstract

“…CVSkSA first prunes the set of functions to be tested using the KNN model and then optimizes them in the function pre-filtering phase using the SVM model to improve the firmware vulnerability. Wang et al [8] implemented a binary similarity detection tool called jTtans by embedding control information of binary code into a Transformer [9]model. The literature [10]uses a neural network translation model to learn the relationship between two architectures and maps the semantic information of the basic blocks of binary functions to a fixed dimensional vector, which in turn measures the similarity by the distance between the.…”

Section: Related Workmentioning

confidence: 99%

Binary function similarity detection based on text semantics

Lü

Zhang

et al. 2023

Third International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)

View full text Add to dashboard Cite

Binary code similarity is that different binary codes obtained from the same source code compiled by different compiler configurations are similar. Binary code similarity detection is often used to evaluate whether functions in two binary codes are similar. This technique has critical applications in intellectual property protection and IoT security, such as code plagiarism detection, malware detection, vulnerability detection, etc. In this paper, we propose a text semantics-based binary function similarity detection model SBFS, which firstly transforms binary functions into function texts by preprocessing assembly instructions; then learns function texts to obtain semantic embedding vectors using a natural language processing model; Finally, the similarity between two functions is measured by calculating the cosine distance between the embedding vectors of the two functions. Experimental results show that the SBFS model can achieve crossarchitecture detection and higher accuracy with 98.2% in the binary function similarity detection task.

show abstract

“…Then, it uses embedding technology to transform the graph into vector representations and uses these vectors to train a classifier to detect vulnerabilities. Transformer models, like JTrans [35], have also been utilized in this area. JTrans incorporates control flow information into the Transformer model for binary code similarity detection.…”

Section: Related Workmentioning

confidence: 99%

SCGformer: Smart contract vulnerability detection based on control flow graph and transformer

Gong,

Song,

Wang

et al. 2023

IET Blockchain

View full text Add to dashboard Cite

The security of smart contract has always been one of the significant problems in blockchain. As shown in previous studies, vulnerabilities in smart contracts can lead to unpredictable losses. With the rapid growth of the number of smart contracts, more and more data driven detection technologies based on machine learning have been proposed. However, some state‐of‐the‐art approaches mainly rely on the source code of smart contract. These methods are limited by the openness of the source code and the version of the programming language. To address this problem, we propose a novel vulnerability detection method based on transformer by constructing the control flow graph (CFG) of smart contracts operation codes (opcodes), which shields the difference of various versions of program language. Extensive experiments are conducted to evaluate the effectiveness of the proposed method on the authors' own collected dataset. The experimental results show that the proposed method achieves 94.36% accuracy in vulnerability detection, which performs better than other state‐of‐the‐art methods.

show abstract

jTrans: jump-aware transformer for binary code similarity detection

Cited by 68 publications

References 39 publications

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

Binary function similarity detection based on text semantics

SCGformer: Smart contract vulnerability detection based on control flow graph and transformer

Contact Info

Product

Resources

About