CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation

Nafi, Kawser Wazed; Kar, Tonny Shekha; Roy, Banani; Roy, Chanchal K.; Schneider, Kevin A.

doi:10.1109/ase.2019.00099

Cited by 78 publications

(34 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, this way is limited for two reasons: (1) the library functions are OS dependent; (2) it fails to recognize the library calls that have different names yet with similar functionality (e.g.,memcpy and memmove) [39]. To address the above problems, inspired by CLCDSA [81], the similarity of cross-os library calls can be learned with the help of the documentation and Mikolov's Word2Vec [82] model.…”

Section: A Impacts Caused By Internal Reasonsmentioning

confidence: 99%

Interpretation-Enabled Software Reuse Detection Based on a Multi-level Birthmark Model

Zheng

Yan

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

Software reuse, especially partial reuse, poses legal and security threats to software development. Since its source codes are usually unavailable, software reuse is hard to be detected with interpretation. On the other hand, current approaches suffer from poor detection accuracy and efficiency, far from satisfying practical demands. To tackle these problems, in this paper, we propose ISRD, an interpretation-enabled software reuse detection approach based on a multi-level birthmark model that contains function level, basic block level, and instruction level. To overcome obfuscation caused by cross-compilation, we represent function semantics with Minimum Branch Path (MBP) and perform normalization to extract core semantics of instructions. For efficiently detecting reused functions, a process for "intent search based on anchor recognition" is designed to speed up reuse detection. It uses strict instruction match and identical library call invocation check to find anchor functions (in short anchors) and then traverses neighbors of the anchors to explore potentially matched function pairs. Extensive experiments based on two realworld binary datasets reveal that ISRD is interpretable, effective, and efficient, which achieves 97.2% precision and 94.8% recall. Moreover, it is resilient to cross-compilation, outperforming stateof-the-art approaches.

show abstract

Section: A Impacts Caused By Internal Reasonsmentioning

confidence: 99%

Interpretation-Enabled Software Reuse Detection Based on a Multi-level Birthmark Model

Zheng

Yan

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

show abstract

“…The approach is a semi-supervised machine learning model which is capable of detecting cross-language clones by employing a token level vector generation algorithm and tree-based skip-gram algorithm. This approach does not support more granular clone type classification (type 1, 2, 3, and 4) [25], which use action filters to filter out non-probable clones and make the model more scalable. This method has the limitation with respect to more granular clone classifications.…”

Section: Related Workmentioning

confidence: 99%

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Ankali¹,

Parthiban²

2021

IJISA

View full text Add to dashboard Cite

A complete and accurate cross-language clone detection tool can support software forking process that reuses the more reliable algorithms of legacy systems from one language code base to other. Cross-language clone detection also helps in building code recommendation system. This paper proposes a new technique to detect and classify cross-language clones of C and C++ programs by filtering the nodes of ANTLR-generated parse tree using a common grammar file, CPP14.g4. Parsing the input files using CPP14.g4 provides all the lexical and semantic information of input source code. Selective filtering of nodes performs serialization of two parse trees. Vector representation using term frequency inverse document frequency (TF-IDF) of the resultant tree is given as an input to cosine similarity to classify the clone types. Filtered parse tree of C and C++ increases the precision from 51% to 61%, and matching based on renaming the input/output expressions provides average precision of 91.97% and 95.37% for small scale and large scale repositories respectively. The proposed cross-language clone detection exhibits the highest precision of 95.37% in finding all types of clones (1, 2, 3 and 4) for 16,032 semantically similar clone pairs of C and CPP codes.

show abstract

“…en, the tree structures are converted into token sequences or vectors to improve the efficiency of similarity measure. In addition, Nafi et al [30] combine the approaches of AST and attribute counting to detect the similarity of cross-language source code. However, the intermediate representation based on trees cannot represent the logical structure of the source code completely, such as the loop structure.…”

Section: Cross-language Source Code Similarity Detection Through Tree-based Intermediate Representationmentioning

confidence: 99%

“…Nafi et al [29] propose CLCDSA, which selects nine measurement attributes and obtain feature measurement values by traversing the AST (abstract syntax tree). Flores et al [30] propose DeSoCoRe to extract code features by tri-gram model and weights word frequency based on normalized term frequency. e similarity between codes is calculated by cosine similarity.…”

Section: Code Similarity Detection Effectiveness Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Flowchart-Based Cross-Language Source Code Similarity Detection

Feng

Liu

et al. 2020

Scientific Programming

View full text Add to dashboard Cite

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

show abstract

CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation

Cited by 78 publications

References 41 publications

Interpretation-Enabled Software Reuse Detection Based on a Multi-level Birthmark Model

Interpretation-Enabled Software Reuse Detection Based on a Multi-level Birthmark Model

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Flowchart-Based Cross-Language Source Code Similarity Detection

Contact Info

Product

Resources

About