Comparing techniques for authorship attribution of source code

Burrows, Steven; Uitdenbogerd, Alexandra L.; Turpin, Andrew

doi:10.1002/spe.2146

Cited by 55 publications

(43 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They believed executable code, even if optimized, still contains many features such as data structures and algorithms, compiler and system information and so on that may help to identify the author. In a recent survey study [6], previous attempts at attributing authorship of normal source code are categorized by two attributes: the source code metrics used for the classification, either strings of n tokens/bytes (n-grams) or object-oriented metrics such as number of classes, interfaces, etc. ; and the classification technique that exploits those features, either information retrieval ranking or machine learning.…”

Section: Other Methods Of Protecting Binariesmentioning

confidence: 99%

An attempt toward Authorship Analysis of Obfuscated .NET Binaries

Morovati¹

2017

IJCSDF

View full text Add to dashboard Cite

This research is an attempt toward facilitating the authorship attribution of an unknown .NET executable by identifying obfuscation resistant features of .NET binaries. The primary goal of this study is to examine the effectiveness of obfuscation techniques for hiding the author's programming style. In this research, I have tested features such as op-code frequencies, op-code n-grams, API function calls and some features obtained from program Control Flow Graph.

show abstract

Section: Other Methods Of Protecting Binariesmentioning

confidence: 99%

An attempt toward Authorship Analysis of Obfuscated .NET Binaries

Morovati¹

2017

IJCSDF

View full text Add to dashboard Cite

show abstract

“…Another proposed method concerns the extraction of token ngrams from code, where each token may refer to an operator, a keyword, a function etc. [3].…”

Section: Relevant Workmentioning

confidence: 99%

“…Source code author identification can be seen as a text classification task given that samples of known authorship by a set of candidate authors are available [2]. Burrows [3] presents an excellent review of software forensics applications associated with this task. These include assisting the revealing of academic dishonesty cases, resolving disputes and litigation about source code samples, tracing the authors of malicious software, and assisting the maintenance of large software projects by assigning source code samples to contributors.…”

Section: Introductionmentioning

confidence: 99%

“…Some very promising results have been reported when dealing with short samples of code, multiple candidate authors and several programming languages, including C, C++, Java, and Lisp [2,3,7]. To evaluate the proposed models, the published studies use custom-built source code collections that include either balanced training sets (where the training samples are equally distributed over the candidate authors) or imbalanced (skewed) training sets (where some candidate authors are overrepresented or under-represented in the training samples).…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, a more accurate definition of a skewed training set should also account for the lines of code (or KBs) of training samples per author. Existing studies avoid focusing on the class imbalance problem and its effect on the performance of source code author identification methods [3,4,7].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Author Identification in Imbalanced Sets of Source Code Samples

Chatzicharalampous

Frantzeskou

Stamatatos

2012

2012 IEEE 24th International Conference on Tools With Artificial Intelligence

View full text Add to dashboard Cite

Abstract-Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instancebased paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.

show abstract

De‐anonymizing Ethereum blockchain smart contracts through code attribution

2020

View full text Add to dashboard Cite

SummaryBlockchain users are identified by addresses (public keys), which cannot be easily linked back to them without out‐of‐network information. This provides pseudo‐anonymity, which is amplified when the user generates a new address for each transaction. Since all transaction history is visible to all users in public blockchains, finding affiliation between related addresses undermines pseudo‐anonymity. Such affiliation information can be used to discriminate against addresses linked with undesired activities or can lead to de‐anonymization if out‐of‐network information becomes available.In this work, we propose an approach to undermine pseudo‐anonymity of blockchain transactions by linking together addresses that were used to deploy smart contracts, which were produced by the same authors. In our approach, we leverage stylometry techniques, widely used in the social science field for attribution of literary texts to their corresponding authors. The assumption underlying authorship attribution is the existence of a distinctive writing style, unique to an author and easily distinguishable from others. Drawing an analogy between literary text and smart contracts' source code, we explore the extent to which unique features of source code and byte code of Ethereum smart contracts can represent the coding style of smart contract developers. We show that even a small number of representative features leads to a sufficiently high accuracy in attributing smart contracts' code to its deployer's address. We further validate our approach on real‐world scammers' data and Ponzi scheme‐related contracts. Additionally, we provide an algorithm to extract distinctly contributing features per an entire dataset or per specific authors. We use this algorithm to extract and explore such features in our dataset and in the Ponzi scheme‐related dataset.

show abstract

Comparing techniques for authorship attribution of source code

Cited by 55 publications

References 29 publications

An attempt toward Authorship Analysis of Obfuscated .NET Binaries

An attempt toward Authorship Analysis of Obfuscated .NET Binaries

Author Identification in Imbalanced Sets of Source Code Samples

De‐anonymizing Ethereum blockchain smart contracts through code attribution

Contact Info

Product

Resources

About