Automatic identification of function clones on crossplatform aims at determining whether two functions are identical or not without access to the source code, which is a fundamental challenge in vulnerability search, code plagiarism detection, and malware classification. With the rapid development of deep neural network in pro-K E Y W O R D S attention mechanism, binary similarity, graph neural network, program analysis
| INTRODUCTIONBinary code similarity detection is a key foundation in maintaining software security, which can be generally applied in searching critical vulnerability 1 and detecting malware, 2 and so forth. Today, with the booming development of billions of Internet of Things (IoT) devices, the similarity analysis of binary code has become even more important than ever, because (1) code reuse is quite common in IoT which spreads a single vulnerability at source code level across thousands or more embedded devices that have diverse hardware architectures and software platforms 3 and (2) a known vulnerability that has been exploited in the past can also cause unprecedented damage and significant revenue losses to our daily life. 4 Worryingly, accurate and efficient analysis of stripped binary code is difficult because debug information such as symbol names, types, and location of commercial off-the-shelf (COTS) binary is unavailable. Besides, compared to binary code similarity detection on single-platform, function identification on cross-platform needs to deal with the binary code differences caused by various architectures, optimization levels, and compilers, which is even more challenging.Traditional binary code similarity detection approaches 5 adopt symbolic execution, constraint solving, or theorem prover to directly check binary code equivalence. However, these methods are inefficient especially when the number of function pairs is large, making them difficult to be utilized in large-scale applications. Graph matching-based methods use program flow diagrams to represent functions and detect similar functions. [6][7][8][9][10][11] These methods are usually inaccurate or their detection performance is limited by the efficiency of graph matching algorithms. 1,12 For instance, Genius 10 takes more than 1 week to construct their proposed codebook for only three software packages and the time complexity is quadratic in the number of training samples and linear in the cost of the bipartite graph matching algorithm.Recently, with the rapid development of deep learning and graph neural network (GNN), 13,14 graph matching and classification problems such as social network graph analytics 15 and chemical formula matching 13 can be solved well by transforming a representative graph to an embedding.