PyMT5: multi-mode translation of natural language and Python code with transformers

Clement, Colin B.; Drain, Dawn; Timcheck, Jonathan; Svyatkovskiy, A.; Sundaresan, Neel

doi:10.18653/v1/2020.emnlp-main.728

Cited by 88 publications

(74 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ROUGE-L metrics are dramatically improved, and is not necessarily surprising as XPyMT5 is conditioned on much more information than PyMT5. The syntax correctness of our fine-tuned models is slightly lower than the 92.1% reported by Clement et al (2020).…”

Section: Methods Completion Evaluation Resultscontrasting

confidence: 68%

“…eWASH yields N total training samples from a file with N total methods and class methods. For docstring completion or code summarization, the source contains the method signature and body, and the target contains the desired docstring, and a control code is used to instruct the model which task it is to perform, just like PyMT5 (Clement et al, 2020).…”

Section: Extended Window Access By Syntax Hierarchymentioning

confidence: 99%

“…XPyMT5 uses the same whitespace-augmented GPT-2 (Radford et al, 2018) tokenizer as PyMT5, with about a vocabulary size of 50,000, and is the same architecture and hyperparameters as PyMT5 with 12 layers and 406M parameters. XPyMT5 was trained on 16 32GB Tesla V100 GPUs for 4 weeks, about 10 epochs total, using the same hyperparameters as reported by Clement et al (2020). XPyMT5 was initialized with the English pretrained BART (Lewis et al, 2019) weights (with whitespace embeddings) and pre-trained using the BART de-noising objective for 5 weeks on the same hardware as above.…”

Section: Xpymt5mentioning

confidence: 99%

“…Large transformer models (Vaswani et al, 2017) and the pre-training/fine-tuning paradigm (Devlin et al, 2018;Lewis et al, 2019;Radford et al, 2018) have become an essential part of state of the art natural language processing. Beyond the domain of natural language, these models and procedures have enabled rapid progress in the software engineering space, including applications in code completion (Svyatkovskiy et al, , 2019Clement et al, 2020;Raychev et al, 2014;Bruch et al, 2009), natural language to code (NL2Code), code feature summarization (Clement et al, 2020;Moreno et al, 2013;Scalabrino et al, 2017;Wan et al, 2018;Alon et al, 2018;Moreno et al, 2014), code search (Husain et al, 2019;Feng et al, 2020), unit test generation (Tufano et al, 2020) and even bug fixing and detection (Zhai et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Clement¹,

Lu²,

Liu³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architectureindependent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixedlength window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for userexperience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on filelevel context.

show abstract

Section: Methods Completion Evaluation Resultscontrasting

confidence: 68%

Section: Extended Window Access By Syntax Hierarchymentioning

confidence: 99%

Section: Xpymt5mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Clement¹,

Lu²,

Liu³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Code Summarization (Movshovitz-Attias and Cohen, 2013;Allamanis et al, 2016;Iyer et al, 2016;Alon et al, 2019a;Hu et al, 2018;Harer et al, 2019;Ahmad et al, 2020), Bug Detection (Ray et al, 2016;Li et al, 2018b;Russell et al, 2018;, Program Repair (Chen et al, 2019;Lutellier et al, 2020), Code Translation (Chen et al, 2018;Drissi et al, 2018;Xu et al, 2020), Clone Detection (Zhang et al, 2019;Yu et al, 2019;, Code completion (Li et al, 2018a;Hellendoorn and Devanbu, 2017;Parvez et al, 2018) are some of the tasks that are addressed with deep neural solution. While most of the prior approaches use task-specific representation learning, a few works (Alon et al, 2019b;Feng et al, 2020;Lachaux et al, 2020;Clement et al, 2020) attempted to learn transferable representations in an unsupervised fashion. More closely to our work, CodeBERT (Feng et al, 2020) is pre-trained on bimodal data to capture the semantic interaction between the input modalities (i.e., program and natural languages).…”

Section: Deep Learning In Software Engineeringmentioning

confidence: 99%

Unified Pre-training for Program Understanding and Generation

Ahmad¹,

Chakraborty²,

Ray³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

345

146

View full text Add to dashboard Cite

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

show abstract