Statistical similarity of binaries

David, Yaniv; Partush, Nimrod; Yahav, Eran

doi:10.1145/2908080.2908126

Cited by 103 publications

(105 citation statements)

References 16 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…Single Platform solutions -Regarding the literature of binary-similarity for a single platform, a family of works is based on matching algorithms for function CFGs. In Bindiff [13] matching among vertices is based on the syntax of code, and it is known to perform poorly across different compiler (see [9]). Pewny et al [24] proposed a solution where each vertex of a CFG is represented with an expression tree; similarity among vertices is computed by using the edit distance between the corresponding expression trees.…”

Section: Work Not Based On Embeddingsmentioning

confidence: 99%

“…David and Yahav [11] proposed to represent a function as several independent execution traces, called tracelets; similar tracelets are then matched by using a custom edit-distance. A related concept is used by David et al in [9] where functions are divided in pieces of independent code, called strands. The matching between functions is based on how many statistically significant strands are similar.…”

Section: Work Not Based On Embeddingsmentioning

confidence: 99%

See 1 more Smart Citation

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Massarelli

Luna

Petroni

et al. 2019

Lecture Notes in Computer Science

132

155

View full text Add to dashboard Cite

The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multi-dimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semantic-based binary function search) 1 https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/ 2 https://www.cvedetails.com/browse-by-date.php 1 arXiv:1811.05296v3 [cs.CR]

show abstract

Section: Work Not Based On Embeddingsmentioning

confidence: 99%

Section: Work Not Based On Embeddingsmentioning

confidence: 99%

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Massarelli

Luna

Petroni

et al. 2019

Lecture Notes in Computer Science

132

155

View full text Add to dashboard Cite

show abstract

“…First, static plagiarism detection or clone detection includes string-based [2], [5], [15], AST-based [32], [57], [63], [36], token-based [33], [55], [54], and PDGbased [22], [40], [11], [39]. Source code-based approaches are Recent works have applied traditional approaches to addressing the cross-architecture scenario [53], [19], [8], [20], [13], [14], [12]. Multi-MH and Multi-k-MH [53] are the first two methods for comparing functions of different ISAs.…”

Section: Related Workmentioning

confidence: 99%

“…discovRE [19] boosts CFG-based matching process, but is still expensive. Both Esh [12] and its successor [13] use dataflow slices of basic blocks as the basic comparable unit. Esh uses SMT solver to verify function similarity, which makes it unscalable.…”

Section: Related Workmentioning

confidence: 99%

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Redmond¹,

Luo²,

Zeng³

2019

Proceedings 2019 Workshop on Binary Analysis Research

View full text Add to dashboard Cite

Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for many tasks, such as code plagiarism detection and malware analysis. Today, source code is very often compiled for various architectures, making cross-architecture binary code analysis increasingly important. A binary, after being disassembled, is expressed in an assembly languages. Thus, recent work starts exploring Natural Language Processing (NLP) inspired binary code analysis. In NLP, words are usually represented in high-dimensional vectors (i.e., embeddings) to facilitate further processing, which is one of the most common and critical steps in many NLP tasks. We regard instructions as words in NLPinspired binary code analysis, and aim to represent instructions as embeddings as well.To facilitate cross-architecture binary code analysis, our goal is that similar instructions, regardless of their architectures, have embeddings close to each other. To this end, we propose a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within an architecture, but also their semantic relationships across architectures. To the best of our knowledge, this is the first work on building crossarchitecture instruction embedding model. As a showcase, we apply the model to resolving one of the most fundamental problems for binary code similarity comparison-semantics-based basic block comparison, and the solution outperforms the code statistics based approach. It demonstrates that it is promising to apply the model to other cross-architecture binary code analysis tasks.

show abstract

“…Algorithm 1 presents the pseudo-code of instrumentation. BINMATCH traverses each instruction (I) of F. If I accesses global variables, performs comparison operations, or calls a standard library function, BINMATCH injects code before I Ir ← record_oprd_val (Ir) 8 if I calls a standard library function then 9 Ir ← record_libc_name (Ir) 10 // record runtime information 11 if I reads an argument of the function then 12 Ir ← record_arg_val (Ir) 13 else if I calls a function indirectly then 14 Ir ← record_func_addr (Ir) 15 else if a function returns then 16 Ir ← record_ret_val (Ir) 17 return Ir to capture corresponding features and generate the signature of F (Line 4-9).…”

Section: B Instrumentation and Executionmentioning

confidence: 99%

BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis

Zhang

et al. 2018

2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)

View full text Add to dashboard Cite

Binary code clone analysis is an important technique which has a wide range of applications in software engineering (e.g., plagiarism detection, bug detection). The main challenge of the topic lies in the semantics-equivalent code transformation (e.g., optimization, obfuscation) which would alter representations of binary code tremendously. Another challenge is the trade-off between detection accuracy and coverage. Unfortunately, existing techniques still rely on semantics-less code features which are susceptible to the code transformation. Besides, they adopt merely either a static or a dynamic approach to detect binary code clones, which cannot achieve high accuracy and coverage simultaneously.In this paper, we propose a semantics-based hybrid approach to detect binary clone functions. We execute a template binary function with its test cases, and emulate the execution of every target function for clone comparison with the runtime information migrated from that template function. The semantic signatures are extracted during the execution of the template function and emulation of the target function. Lastly, a similarity score is calculated from their signatures to measure their likeness. We implement the approach in a prototype system designated as BINMATCH which analyzes IA-32 binary code on the Linux platform. We evaluate BINMATCH with eight real-world projects compiled with different compilation configurations and commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison. The experimental results show that BINMATCH is robust to the semantics-equivalent code transformation. Besides, it not only covers all target functions for clone analysis, but also improves the detection accuracy comparing to the state-of-the-art solutions.

show abstract

Statistical similarity of binaries

Cited by 103 publications

References 16 publications

SAFE: Self-Attentive Function Embeddings for Binary Similarity

SAFE: Self-Attentive Function Embeddings for Binary Similarity

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis

Contact Info

Product

Resources

About