No abstract
Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental
Recent years have seen the successful application of deep learning to software engineering (SE). In particular, the development and use of pre-trained models of source code has enabled state-of-the-art results to be achieved on a wide variety of SE tasks. This paper provides an overview of this rapidly advancing field of research and reflects on future research directions. Type I-O Task Definition ID -Dataset Metrics Und. C-V WB Wrong Binary Operator: Check if a given piece of code contains any incorrect binary operators. K1 -Kanade et al. [2020] Acc ET Exception Type: Predict the precise exception type. K1 -Kanade et al. [2020] Acc BD Bug Detection / Defect Detection: Check if a given function contains a defect. D1 -Devign [2019] Acc P1 -Pradel et al. [2018] Acc CD Clone Detection: Determine whether two code snippets are semantically equivalent. B1 -BigCloneBench [2014] F1 C1 -CLCDSA [2019] P/R/F1 CC Code Classification: Classify the category of a given function. P2 -POJ-104 [2016] Acc/MAP@R FD Function-Docstring Mismatch: Determine whether a given function and the docstring correspond to each other. K1 -Kanade et al. [2020] Acc C-C CR Code-to-Code Retrieval: Retrieve semantically similar code for a given piece of query code. C1 -CLCDSA [2019] Acc/MRR/NDCG P2 -POJ-104 [2016] MAP@R VM Variable-Misuse Localization and Repair: Identify the location of a misused variable and return the correct one. V1 -Vasic et al. [2019] Acc CT Cloze Test: Predict the masked token from code. D2 -De Sousa et al. [2021] Acc NL-C CS Code Search / Text-to-Code Retrieval: Find the most relevant piece of code from a set of candidates for a given natural language description. C2 -CodeSearchNet [2019] MRR C3 -AdvText [2021] MRR/F1/Acc Gen. C-C CP Code Completion: Predict the missing/following token(s) of a given code context. S1 -Svyatkovskiy et al. [2020] RL/EditSim. L1 -Liu et al. [2020] Acc A1 -Alon et al. [2020] Acc@k TL Code Translation: Translate the code in one programming language to the code in another programming language. C4 -Chen et al. [2018] BLEU/Acc/CBLEU T1 -TransCorder [2020] Acc C1 -CLCDSA [2019] BLEU/RL/CIDER BF Bug Fixing: Repair buggy code by generating the correct version. T2 -Tufano et al. [2019b] BLEU/Acc/CBLEU MG Mutant Generation: Inject in working code a mutant for a real bug. T3 -Tufano et al. [2019a] Acc AG Assert Generation: Generate a correct unit test assert statement. W1 -Watson et al. [2020] Acc@k C-NL SU Code Summarization / Code Documentation: Generate a textual description that describes the functionality of a function. C2 -CodeSearchNet [2019] BLEU H1 -Haque et al. [2020] BLEU/RL H2 -Hu et al. [2018a] BLEU H3 -Hu et al. [2018b] BLEU/METEOR M1 -Miceli et al. [2017] BLEU MN Method Naming / Extreme Code Summarization: Predict the function name of a given function body.
While a large number of pre-trained models of source code have been successfully developed and applied to a variety of software engineering (SE) tasks in recent years, our understanding of these pre-trained models is arguably fairly limited. With the goal of advancing our understanding of these models, we perform the first systematic empirical comparison of 19 recently-developed pre-trained models of source code on 13 SE tasks. To gain additional insights into these models, we adopt a recently-developed 4-dimensional categorization of pretrained models, and subsequently investigate whether there are correlations between different categories of pre-trained models and their performances on different SE tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.