Deep learning code fragments for code clone detection

White, Martin; Tufano, Michele; Vendome, Christopher; Poshyvanyk, Denys

doi:10.1145/2970276.2970326

Cited by 502 publications

(334 citation statements)

References 60 publications

Supporting

Mentioning

331

Contrasting

Unclassified

Order By: Relevance

“…The key challenge is to accurately represent the structure of code changes, which are not contiguous text like the commit message, but rather amount to scattered fragments of removed and added code across multiple files, within multiple hunks. Thus, different from existing deep learning techniques working on source code [24], [36], [66], [68], PatchNet constructs separate embedding vectors representing the removed code and the added code in each hunk of each affected file in the given patch. The information about a file's hunks are then concatenated to build an embedding vector for the affected file.…”

Section: Introductionmentioning

confidence: 99%

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Hoang

Lawall²,

Tian³

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Linux kernel stable versions serve the needs of users who value stability of the kernel over new features. The quality of such stable versions depends on the initiative of kernel developers and maintainers to propagate bug fixing patches to the stable versions. Thus, it is desirable to consider to what extent this process can be automated. A previous approach relies on words from commit messages and a small set of manually constructed code features. This approach, however, shows only moderate accuracy. In this paper, we investigate whether deep learning can provide a more accurate solution. We propose PatchNet, a hierarchical deep learning-based approach capable of automatically extracting features from commit messages and commit code and using them to identify stable patches. PatchNet contains a deep hierarchical structure that mirrors the hierarchical and sequential structure of commit code, making it distinctive from the existing deep learning models on source code. Experiments on 82,403 recent Linux patches confirm the superiority of PatchNet against various state-of-the-art baselines, including the one recently-adopted by Linux kernel maintainers.

show abstract

Section: Introductionmentioning

confidence: 99%

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Hoang

Lawall²,

Tian³

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…As our approach based on code For clone detection, many techniques in the literature generally begin by generating some intermediate representations for code before measuring similarity. According to source code representation, these techniques can be classified as text-based (e.g., [38]- [40]), token-based (e.g., [41]- [43]), tree-based (e.g., [24], [44], [45]), graph-based (e.g., [46]- [49]), semantic-based (e.g., [50]- [53]), deep-learning-based (e.g., [35], [54]), or a mixture. Our approach complements those studies by applying word embedding to smart contract code and its syntax structures to search for smart contracts of various levels of granularity.…”

Section: Clone Detection Bug Detection and Code Validationmentioning

confidence: 99%

Checking Smart Contracts With Structural Code Embedding

Gao

Jiang

Xia

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Smart contracts have been increasingly used together with blockchains to automate financial and business transactions. However, many bugs and vulnerabilities have been identified in many contracts which raises serious concerns about smart contract security, not to mention that the blockchain systems on which the smart contracts are built can be buggy. Thus, there is a significant need to better maintain smart contract code and ensure its high reliability. In this paper, we propose an automated approach to learn characteristics of smart contracts in Solidity, which is useful for clone detection, bug detection and contract validation on smart contracts. Our new approach is based on word embeddings and vector space comparison. We parse smart contract code into word streams with code structural information, convert code elements (e.g., statements, functions) into numerical vectors that are supposed to encode the code syntax and semantics, and compare the similarities among the vectors encoding code and known bugs, to identify potential issues. We have implemented the approach in a prototype, named SMARTEMBED , and evaluated it with more than 22,000 smart contracts collected from the Ethereum blockchain. Results show that our tool can effectively identify many repetitive instances of Solidity code, where the clone ratio is around 90%. Code clones such as type-III or even type-IV semantic clones can also be detected accurately. Our tool can identify more than 1000 clone related bugs based on our bug databases efficiently and accurately. Our tool can also help to efficiently validate any given smart contract against a known set of bugs, which can help to improve the users' confidence in the reliability of the contract. !

show abstract

“…In this paper, we build FA-AST for Java programs and evaluate FA-AST and graph neural networks on two code clone datasets: Google Code Jam dataset collected by [6] and the widely used clone detection benchmark BigCloneBench [9]. The results show that our approach outperforms most existing clone detection approaches, especially several ASTbased deep learning approaches including RtvNN [2], CDLH [3] and ASTNN [4].…”

Section: Introductionmentioning

confidence: 96%

“…Most of these approaches include two steps: use neural networks to calculate a vector representation for each code fragment, then calculate the similarity between two code vector representations to detect clones. To leverage the explicit structural information in programs, these approaches often use abstract syntax tree (AST) as the input of their models [2]- [4]. A typical example of these approaches is CDLH [3], which encode code fragments by directly applying Tree-LSTM [5] on binarized ASTs.…”

Section: Introductionmentioning

confidence: 99%

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Wang

et al. 2020

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)

202

113

View full text Add to dashboard Cite

Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection.We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

show abstract

Deep learning code fragments for code clone detection

Cited by 502 publications

References 60 publications

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Checking Smart Contracts With Structural Code Embedding

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Contact Info

Product

Resources

About