Modular Tree Network for Source Code Representation Learning

Wang, Wenhan; Li, Ge; Shen, Sijie; Xia, Xin; Jin, Zhi

doi:10.1145/3409331

Cited by 39 publications

(21 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, Wang et al [21] introduced heterogeneous program graphs by including additional type information for nodes and edges in an AST and used GNNs to learn program properties. In another work, Wang et al [30] use a modular tree-based neural network to detect the semantic difference in code using AST. Some works use Data Flow Graphs to represent source code [31], [32].…”

Section: Evaluation and Resultsmentioning

confidence: 99%

A Mocktail of Source Code Representations

Swarna¹,

Mathews²,

Vagavolu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Efficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of METHODNAMING using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.

show abstract

Section: Evaluation and Resultsmentioning

confidence: 99%

A Mocktail of Source Code Representations

Swarna¹,

Mathews²,

Vagavolu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The majority of the studies rely on the rnn-based dl model. Among them, some of the studies [21,53,133,333,339] employed lstm-based models; while others [54,135,152,348,360] used gru-based models. Among the other kinds of ml models, studies employed gnn-based [85,341], dnn [230], conditional random fields [22], svm [184,253], and cnn-based models [69,225,312].…”

Section: Model Trainingmentioning

confidence: 99%

A Survey on Machine Learning Techniques for Source Code Analysis

Sharma¹,

Kechagia²,

Georgiou³

et al. 2021

Preprint

View full text Add to dashboard Cite

Context:The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis such as testing and vulnerabilities detection. A large number of studies poses challenges to the community to understand the current landscape. Objective: We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. Method: We investigate studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021. We summarize our observations and findings with the help of the identified studies. Results: Our findings suggest that the usage of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task, and summarize the employed machine learning techniques. Additionally, we collate a comprehensive list of available datasets and tools useable in this context. Finally, we summarize the perceived challenges in this area that include availability of standard datasets, reproducibility and replicability, and hardware resources. CCS Concepts: • Software and its engineering → Software libraries and repositories; Software maintenance tools; Software post-development issues; Maintaining software; • Computing methodologies → Machine learning.

show abstract

“…Code representation learning aims at learning the semantics of programs for facilitating various downstream tasks related to program comprehension, such as code clone detection, code summarization, bug detection [30,7,10,28,14,27,15,11], etc. The development of deep learning techniques boosts the research on code representation learning.…”

Section: Code Representation Learningmentioning

confidence: 99%

“…In the work [7,10], code snippets are split into tokens and fed into neural networks such as RNNs and multi-head attentions for the representation learning. Considering the structural nature of code, [28,14,30] combine the abstract syntax trees (ASTs) into neural networks for capturing the code semantics. LeClair et al [14] use GNN-based encoder to model the AST of each program subroutine.…”

Section: Code Representation Learningmentioning

confidence: 99%

Enriching Query Semantics for Code Search with Reinforcement Learning

Wang¹,

Nong²,

Gao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Code search is a common practice for developers during software implementation. The challenges of accurate code search mainly lie in the knowledge gap between source code and natural language (i.e., queries). Due to the limited code-query pairs and large code-description pairs available, the prior studies based on deep learning techniques focus on learning the semantic matching relation between source code and corresponding description texts for the task, and hypothesize that the semantic gap between descriptions and user queries is marginal. In this work, we found that the code search models trained on code-description pairs may not perform well on user queries, which indicates the semantic distance between queries and code descriptions. To mitigate the semantic distance for more effective code search, we propose QueCos, a Query-enriched Code search model. QueCos learns to generate semantic enriched queries to capture the key semantics of given queries with reinforcement learning (RL). With RL, the code search performance is considered as a reward for producing accurate semantic enriched queries. The enriched queries are finally employed for code search. Experiments on the benchmark datasets show that QueCos can significantly outperform the state-of-the-art code search models.

show abstract

Modular Tree Network for Source Code Representation Learning

Cited by 39 publications

References 35 publications

A Mocktail of Source Code Representations

A Mocktail of Source Code Representations

A Survey on Machine Learning Techniques for Source Code Analysis

Enriching Query Semantics for Code Search with Reinforcement Learning

Contact Info

Product

Resources

About