Exploring Software Naturalness through Neural Language Models

Buratti, Luca; Pujar, Saurabh; Bornea, Mihaela; McCarley, Scott; Zheng, Yunhui; Rossiello, Gaetano; Morari, Alessandro; Laredo, Jim; Thost, Veronika; Zhuang, Yufan; Domeniconi, Giacomo

doi:10.48550/arxiv.2006.12641

Cited by 16 publications

(29 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Syntax-based Generic Approaches: These approaches encode program snippets, either by dividing the program into strings, lexicalizing them into tokens or parsing the program into a parse tree or abstract syntax tree (AST). Syntax-only generic embedding approaches include Code2Vec [3], Code2Seq [2], CodeBERT [15], C-BERT [7], InferCode [6], CC2Vec [24], AST-based NN [65] and ProgHeteroGraph [59] (see Table 2). Notably, these approaches use neural models for representing code (snippets), e.g., via code vector (e.g., Code2Vec [3]), machine translation (e.g., Code2Seq [2]) or transformers (e.g., CodeBERT [15]).…”

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

“…It leverages the syntactic structure of programming languages to encode source code by representing code snippets as the set of paths in the program's AST, then uses attention to select the relevant paths while decoding. Besides, CodeBERT [15], C-BERT [7] and Cu-BERT [29] are BERT-inspired approaches, these methods adopt similar methodologies to learn code representations as BERT [11]. CodeBERT [15] is a bimodal pre-trained model for programming language (PL) and natural language (NL) tasks, which uses transformer-based neural architecture to encode code snippets.…”

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

“…CodeBERT [15] is a bimodal pre-trained model for programming language (PL) and natural language (NL) tasks, which uses transformer-based neural architecture to encode code snippets. Meanwhile, C-BERT [7] pre-trains a large transformer-model on raw source code, then tests if the model can discover AST features. Cu-BERT [29] is similar to C-BERT, but if focuses on embedding programs written in the Python programming language.…”

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

“…In this work, we compare our approach (GraphCode2Vec) to the aforementioned seven (7) learning-based methods for code clone detection and solution classification (see Section 5).…”

Section: Task-specific Learning-based Applicationsmentioning

confidence: 99%

“…In our experiments, we employed seven (7) subject programs written in Java. Table 3 provides details about each of our subject programs and their experimental usage.…”

Section: Subject Programsmentioning

confidence: 99%

See 4 more Smart Citations

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Ma¹,

Zhao²,

Soremekun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks, e.g., method name prediction and code clone detection. To be effective, the embedding needs to capture program semantics in a way that is effective and generic, in a sense being able to effectively support a plethora of SE tasks. To this end, we propose an approach (called Graph-Code2Vec) that captures program semantics through lexical and program dependence features via an effective combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, i.e., it allows pre-training, and it is effectively applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on three (3) tasks (method name prediction, solution classification and code clone detection), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, Graph-CodeBERT) and seven (7) task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines for all three tasks, with the exception of one task where it is slightly less effective than Graph-CodeBERT. We also demonstrate through a purposely designed probing and ablation study that GraphCode2Vec effectively learns lexical and program dependence features and that pre-training improves its effectiveness.

show abstract

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

Section: Background 21 Generic Code Embeddingmentioning

confidence: 99%

“…In this work, we compare our approach (GraphCode2Vec) to the aforementioned seven (7) learning-based methods for code clone detection and solution classification (see Section 5).…”

Section: Task-specific Learning-based Applicationsmentioning

confidence: 99%

“…In our experiments, we employed seven (7) subject programs written in Java. Table 3 provides details about each of our subject programs and their experimental usage.…”

Section: Subject Programsmentioning

confidence: 99%

See 3 more Smart Citations

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Ma¹,

Zhao²,

Soremekun³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Software Defect Prediction via Generative Adversarial Networks and Pre-Trained Model

Song,

Gan,

Bao

2024

IJACSA

View full text Add to dashboard Cite

Software defect prediction, which aims to predict defective modules during software development, has been implemented to assist developers in identifying defects and ensure software quality. Traditional defect prediction methods utilize manually designed features such as "Lines Of Code" that fail to capture the syntactic and semantic structures of code. Moreover, the high cost and difficulty of building the training set lead to insufficient data, which poses a significant challenge for training deep learning models, particularly for new projects. To overcome the practical challenge of data limitation and improve predictive capacity, this paper presents DP-GANPT, a novel defect prediction model that integrates generative adversarial networks and state-of-the-art code pre-trained models, employing a novel bi-modal code-prompt input representation. The proposed approach explores the use of code pre-trained model as autoencoders and employs generative adversarial networks algorithms and semi-supervised learning techniques for optimization. To facilitate effective training and evaluation, a new software defect prediction dataset is constructed based on the existing PROMISE dataset and its associated engineering files. Extensive experiments are performed on both within-project and cross-project defect prediction tasks to evaluate the effectiveness of DP-GANPT. The results reveal that DP-GANPT outperforms all the state-of-theart baselines, and achieves performance comparable to them with significantly less labeled data.

show abstract

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Guo¹,

Lu²,

Duan³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

213

View full text Add to dashboard Cite

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

show abstract

Exploring Software Naturalness through Neural Language Models

Cited by 16 publications

References 28 publications

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Software Defect Prediction via Generative Adversarial Networks and Pre-Trained Model

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Contact Info

Product

Resources

About