2012
DOI: 10.1002/spe.2146
|View full text |Cite
|
Sign up to set email alerts
|

Comparing techniques for authorship attribution of source code

Abstract: SUMMARYAttributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n‐grams) or software metrics; and the classification technique that exploits those features, either information re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
42
0

Year Published

2012
2012
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 55 publications
(43 citation statements)
references
References 29 publications
1
42
0
Order By: Relevance
“…They believed executable code, even if optimized, still contains many features such as data structures and algorithms, compiler and system information and so on that may help to identify the author. In a recent survey study [6], previous attempts at attributing authorship of normal source code are categorized by two attributes: the source code metrics used for the classification, either strings of n tokens/bytes (n-grams) or object-oriented metrics such as number of classes, interfaces, etc. ; and the classification technique that exploits those features, either information retrieval ranking or machine learning.…”
Section: Other Methods Of Protecting Binariesmentioning
confidence: 99%
“…They believed executable code, even if optimized, still contains many features such as data structures and algorithms, compiler and system information and so on that may help to identify the author. In a recent survey study [6], previous attempts at attributing authorship of normal source code are categorized by two attributes: the source code metrics used for the classification, either strings of n tokens/bytes (n-grams) or object-oriented metrics such as number of classes, interfaces, etc. ; and the classification technique that exploits those features, either information retrieval ranking or machine learning.…”
Section: Other Methods Of Protecting Binariesmentioning
confidence: 99%
“…Another proposed method concerns the extraction of token ngrams from code, where each token may refer to an operator, a keyword, a function etc. [3].…”
Section: Relevant Workmentioning
confidence: 99%
“…Source code author identification can be seen as a text classification task given that samples of known authorship by a set of candidate authors are available [2]. Burrows [3] presents an excellent review of software forensics applications associated with this task. These include assisting the revealing of academic dishonesty cases, resolving disputes and litigation about source code samples, tracing the authors of malicious software, and assisting the maintenance of large software projects by assigning source code samples to contributors.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations