2009 IEEE 17th International Conference on Program Comprehension 2009
DOI: 10.1109/icpc.2009.5090050
|View full text |Cite
|
Sign up to set email alerts
|

Syntax tree fingerprinting for source code similarity detection

Abstract: Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abs… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 70 publications
(36 citation statements)
references
References 22 publications
0
36
0
Order By: Relevance
“…Abstract syntax tree representations could allow more sophisticate patterns of pre-processing of the representation for better abstraction and normalization of the code, a topic that has been neglected in this article. We are investigating some new techniques in this way [21,32,23] that could also consider the function call graphs of the projects. for the computed similarity metrics between the original project and the obfuscated versions.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Abstract syntax tree representations could allow more sophisticate patterns of pre-processing of the representation for better abstraction and normalization of the code, a topic that has been neglected in this article. We are investigating some new techniques in this way [21,32,23] that could also consider the function call graphs of the projects. for the computed similarity metrics between the original project and the obfuscated versions.…”
Section: Resultsmentioning
confidence: 99%
“…It explains the choice of a suffix array as an indexing structure rather than a suffix tree. As introduced by [18], some tools, like CCFinderX [19] or Phoenix [20], have successfully used suffix indexation structures to find duplication in source code using a tokenized form or sibling abstracted syntax sub-trees [21].…”
Section: Studying the Factorized Graph Nodes And Its Inferred Metricsmentioning
confidence: 99%
“…Each source code is converted into a parse tree and its contents are translated into token sequence by applying inorder traversal. Chilowicz et al [40] also incorporates parse-tree approach. Yet, their work generates token sequence based on fingerprinting mechanism instead of inorder traversal.…”
Section: Related Workmentioning
confidence: 99%
“…Several works incorporate additional preprocessing to generate more declarative lexical token sequence [38,39,40]. Chilowics et al [38] incorporates function factorization when generating lexical token sequence.…”
Section: Related Workmentioning
confidence: 99%
“…Traditional machine learning approaches largely depend on human feature engineering, e.g., [17] for bug detection, [18] for clone detection. Such feature engineering is labelconsuming and ad hoc to a specific task.…”
Section: Motivation a From Machine Learning To Deep Learningmentioning
confidence: 99%