Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code

Wei, Hui-Hui; Li, Ming

doi:10.24963/ijcai.2017/423

Cited by 246 publications

(220 citation statements)

References 9 publications

Supporting

Mentioning

205

Contrasting

Unclassified

Order By: Relevance

“…Since the majority of code clone pairs are Weak Type-3/Type-4 clones, BigCloneBench is quite appropriate to be used for evaluating semantic clone detection. In our experiment, we follow the settings of in the CDLH paper [3], which discard code fragments without any tagged true or false clone pairs, left with 9,134 code fragments. Table II shows the basic information about the two datasets in our experiment.…”

Section: A Experiments Datamentioning

confidence: 99%

“…CDLH [3] uses binary Tree-LSTM [5] to encode ASTs, and a hash function to optimize the distance between the vector representation of AST pairs by hamming distance. ASTNN [4] uses recursive neural networks to encode AST subtrees for statements, then feed the encodings of all statement trees into an RNN to compute the vector representation for a program.…”

Section: B Experiments Settingsmentioning

confidence: 99%

“…For a pair of code fragments (methods), it calculates eight similarity scores in terms of token frequency in each category to form a feature vector that is then fed into a feedforward neural network. Wei et al [3] proposed CDLH, which used a hash loss to measure the similarity of two code pairs. CDLH first converted program ASTs into binary trees, then used a binary Tree-LSTM [5] to represent these trees.…”

Section: A Code Clone Detection With Deep Learningmentioning

confidence: 99%

“…In this paper, we build FA-AST for Java programs and evaluate FA-AST and graph neural networks on two code clone datasets: Google Code Jam dataset collected by [6] and the widely used clone detection benchmark BigCloneBench [9]. The results show that our approach outperforms most existing clone detection approaches, especially several ASTbased deep learning approaches including RtvNN [2], CDLH [3] and ASTNN [4].…”

Section: Introductionmentioning

confidence: 96%

“…To leverage the explicit structural information in programs, these approaches often use abstract syntax tree (AST) as the input of their models [2]- [4]. A typical example of these approaches is CDLH [3], which encode code fragments by directly applying Tree-LSTM [5] on binarized ASTs. Although AST can reflect the rich structural information for program syntax, it does not contain some semantic information such as control flow and data flow.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Wang

et al. 2020

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)

202

113

View full text Add to dashboard Cite

Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection.We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

show abstract

Section: A Experiments Datamentioning

confidence: 99%

Section: B Experiments Settingsmentioning

confidence: 99%

Section: A Code Clone Detection With Deep Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Wang

et al. 2020

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)

202

113

View full text Add to dashboard Cite

show abstract

CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM model

Ullah

Naeem

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

Software similarity in different programming codes is a rapidly evolving field because of its numerous applications in software development, software cloning, software plagiarism, and software forensics. Currently, software researchers and developers search cross‐language open‐source repositories for similar applications for a variety of reasons, such as reusing programming code, analyzing different implementations, and looking for a better application. However, it is a challenging task because each programming language has a unique syntax and semantic structure. In this paper, a novel tool called Cross‐Language Software Similarity (CroLSSim) is designed to detect similar software applications written in different programming codes. First, the Abstract Syntax Tree (AST) features are collected from different programming codes. These are high‐quality features that can show the abstract view of each program. Then, Methods Description (MDrep) in combination with AST is used to examine the relationship among different method calls. Second, the Term Frequency Inverse Document Frequency approach is used to retrieve the local and global weights from AST‐MDrep features. Third, the Latent Semantic Analysis‐based features extraction and selection method is proposed to extract the semantic anchors in reduced dimensional space. Fourth, the Convolution Neural Network (CNN)‐based features extraction method is proposed to mine the deep features. Finally, a hybrid deep learning model of CNN‐Long‐Short‐Term Memory is designed to detect semantically similar software applications from these latent variables. The data set contains approximately 9.5K Java, 8.8K C#, and 7.4K C++ software applications obtained from GitHub. The proposed approach outperforms as compared with the state‐of‐the‐art methods.

show abstract

Simplified abstract syntax tree based semantic features learning for software change prediction

Xin-yue

Zhang

Tong

2022

J Software Evolu Process

View full text Add to dashboard Cite

Software change prediction aims to identify the change-prone parts of source code, which can help software practitioners allocate resources more efficiently, increase the quality of software products, and reduce maintenance costs. In recent years, researchers have built many change prediction models based on product and process metrics using traditional classification algorithms. However, source code contains rich semantic structural information, which traditional features cannot usually capture.Therefore, extracting the semantic features of code can help improve the performance of existing models. To bridge the gap between semantic features and change prediction, we introduce a novel change prediction approach based on a simplified abstract syntax tree (AST). Specifically, we first extract semantic features from partial AST nodes that pay attention to the syntax and semantic of code instead of all AST nodes. Then, a bidirectional recurrent neural network is utilized to model the deep semantic information of the code for change prediction. We also propose a new dataset that to some extent alleviates the data-imbalance problem, which has become an active research topic. We conducted extensive experiments on the proposed dataset. The results show the effectiveness of semantic features for change prediction. Further, our model outperformed a state-of-the-art code representation method.

show abstract

Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code

Cited by 246 publications

References 9 publications

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM model

Simplified abstract syntax tree based semantic features learning for software change prediction

Contact Info

Product

Resources

About