A Transformer-based Approach for Source Code Summarization

Ahmad, Wasi Uddin; Chakraborty, Saikat; Ray, Baishakhi; Chang, Kai-Wei

doi:10.18653/v1/2020.acl-main.449

Cited by 278 publications

(225 citation statements)

References 24 publications

(26 reference statements)

Supporting

Mentioning

224

Contrasting

Order By: Relevance

“…Vast sources of code in open source repositories and forums make deep learning feasible for SE tasks. Code Summarization (Movshovitz-Attias and Cohen, 2013;Allamanis et al, 2016;Iyer et al, 2016;Alon et al, 2019a;Hu et al, 2018;Harer et al, 2019;Ahmad et al, 2020), Bug Detection (Ray et al, 2016;Li et al, 2018b;Russell et al, 2018;, Program Repair (Chen et al, 2019;Lutellier et al, 2020), Code Translation (Chen et al, 2018;Drissi et al, 2018;Xu et al, 2020), Clone Detection (Zhang et al, 2019;Yu et al, 2019;, Code completion (Li et al, 2018a;Hellendoorn and Devanbu, 2017;Parvez et al, 2018) are some of the tasks that are addressed with deep neural solution. While most of the prior approaches use task-specific representation learning, a few works (Alon et al, 2019b;Feng et al, 2020;Lachaux et al, 2020;Clement et al, 2020) attempted to learn transferable representations in an unsupervised fashion.…”

Section: Deep Learning In Software Engineeringmentioning

confidence: 99%

Unified Pre-training for Program Understanding and Generation

Ahmad¹,

Chakraborty²,

Ray³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

341

146

View full text Add to dashboard Cite

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

show abstract

Section: Deep Learning In Software Engineeringmentioning

confidence: 99%

Unified Pre-training for Program Understanding and Generation

Ahmad¹,

Chakraborty²,

Ray³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

341

146

View full text Add to dashboard Cite

show abstract

“…For this task, we follow the methodology proposed by Ahmad et al (2020). They used a seq2seq Transformer (Vaswani et al, 2017) (2019) and tokenize the dataset using Character BPE Tokenization (Sennrich et al, 2016) to create the same size vocabulary as the previous works.…”

Section: Source Code Summarizationmentioning

confidence: 99%

“…Training We train a Transformer model proposed by Ahmad et al (2020) with CoDesc-train dataset. We use Adam optimizer with an initial learning rate of 10 −4 , mini-batch size of 32, and dropout rate 0.2, vocabulary size 50k for code and 30k for NL.…”

Section: Source Code Summarizationmentioning

confidence: 99%

See 1 more Smart Citation

CoDesc: A Large Code–Description Parallel Dataset

Hasan¹,

Muttaqueen²,

Ishtiaq³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training-finetuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at https://github.com/csebuetnlp/CoDesc.

show abstract

“…There have been a number of proposals for transformers on trees, including phrase-structure trees and dependency trees for natural languages, and abstract syntax trees for programming languages. One common strategy is to linearize a tree into a sequence (Ahmad et al, 2020;Currey and Heafield, 2019). Another strategy is to recognize that transformers are fundamentally defined not on sequences but on bags; all information about sequential order is contained in the positional encodings, so all that is needed to construct a tree transformer is to define new positional encodings on trees (Shiv and Quirk, 2019;Omote et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Syntax-Based Attention Masking for Neural Machine Translation

McDonald¹,

Chiang²

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Rese

View full text Add to dashboard Cite

We present a simple method for extending transformers to source-side trees. We define a number of masks that limit self-attention based on relationships among tree nodes, and we allow each attention head to learn which mask or masks to use. On translation from English to various low-resource languages, and translation in both directions between English and German, our method always improves over simple linearization of the source-side parse tree and almost always improves over a sequence-to-sequence baseline, by up to +2.1% BLEU.

show abstract

A Transformer-based Approach for Source Code Summarization

Cited by 278 publications

References 24 publications

Unified Pre-training for Program Understanding and Generation

Unified Pre-training for Program Understanding and Generation

CoDesc: A Large Code–Description Parallel Dataset

Syntax-Based Attention Masking for Neural Machine Translation

Contact Info

Product

Resources

About