SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Niu, Changan; Li, Chuanyi; Ng, Vincent; Jin, G.; Huang, LiGuo; Luo, Bin

doi:10.48550/arxiv.2201.01549

Cited by 4 publications

(5 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baselines. While comparing the evaluation results for different tasks, we compare with large scale pre-trained models, including GPT-2 [50], CodeGPT [43], PLBART [5], SPT-Code [45] and CodeT5 [63]. Most of our fine-tuning evaluation is on benchmarked dataset; thus, we report the available results from CodeXGLUE leaderboard [3].…”

Section: Methodsmentioning

confidence: 99%

“…For code generation tasks, GPT-3 or BARTstyle models (e.g., Codex, CodeT5, PLBART, SPTCode, etc. [5,19,45,63]) are popular. The important insight here is that independent of final tasks, when very high capacity models are trained with huge code corpora to learn simple, self-supervised, "busy work", they still learn general syntactic and semantic constraints of writing code.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NatGen: Generative pre-training by "Naturalizing" source code

Chakraborty¹,

Ahmed²,

Ding³

et al. 2022

Preprint

View full text Add to dashboard Cite

Pre-trained Generative Language models (e.g., PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow). CCS Concepts• Software and its engineering → Language features; • Computing methodologies → Knowledge representation and reasoning.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

NatGen: Generative pre-training by "Naturalizing" source code

Chakraborty¹,

Ahmed²,

Ding³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Researchers have been passionate about pre-training Transformer models for source code. There are three main architectures for existing models: Encoderonly [6,7,20,30,37,45,75], Decoder-only [4,26,77], and Encoderdecoder [1,11,29,36,62]. Encoder-only models are commonly pretrained with cloze tasks (e.g., masked language model) and sequence understanding tasks (e.g., next statement prediction).…”

Section: Related Workmentioning

confidence: 99%

“…Encoder-only models are commonly pretrained with cloze tasks (e.g., masked language model) and sequence understanding tasks (e.g., next statement prediction). Decoder-only models are mostly trained with autoregressive, left-to-right language model (LM) Encoder-Decoder models are pre-trained with different tasks including denoising autoencoding to reconstruct the wrongly permuted tokens [1], predicting missing identifiers [76], recovering method names [62], etc. In recent years, with the rapid development of computing devices, such as GPUs and TPUs, researcher also shed light on the incredible power of extremely large Transformer models (up to hundreds of billions of parameters) for understanding and generating code [4,26,29,32].…”

Section: Related Workmentioning

confidence: 99%

CONCORD: Clone-Aware Contrastive Learning for Source Code

Ding

Chakraborty

Buratti³

et al. 2023

Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

View full text Add to dashboard Cite

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, selfsupervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection.While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning generalpurpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors.Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code. CCS CONCEPTS• Software and its engineering → Language features; • Computing methodologies → Knowledge representation and reasoning.

show abstract

“…We also set the beam size as 250 while CURE's beam is configured to 1000. According to previous work [58,76], a larger training set and beam size may…”

Section: Rq2: What Is the Performance Of A Single Circle Model Compar...mentioning

confidence: 99%

CIRCLE: Continual Repair across Programming Languages

Wang¹,

Zhang²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatic Program Repair (APR) aims at fixing buggy source code with less manual debugging efforts, which plays a vital role in improving software reliability and development productivity. Recent APR works have achieved remarkable progress via applying deep learning (DL), particularly neural machine translation (NMT) techniques. However, we observe that existing DL-based APR models suffer from at least two severe drawbacks: (1) Most of them can only generate patches for a single programming language, as a result, to repair multiple languages, we have to build and train many repairing models. (2) Most of them are developed offline. Therefore, they won't function when there are new-coming requirements.To address the above problems, a T5-based APR framework equipped with continual learning ability across multiple programming languages is proposed, namely ContI nual Repair aCross Programming LanguagEs (CIRCLE). Specifically, (1) CIRCLE utilizes a prompting function to narrow the gap between natural language processing (NLP) pre-trained tasks and APR. (2) CIRCLE adopts a difficulty-based rehearsal strategy to achieve lifelong learning for APR without access to the full historical data. (3) An elastic regularization method is employed to strengthen CIRCLE's continual learning ability further, preventing it from catastrophic forgetting.

show abstract

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Cited by 4 publications

References 40 publications

NatGen: Generative pre-training by "Naturalizing" source code

NatGen: Generative pre-training by "Naturalizing" source code

CONCORD: Clone-Aware Contrastive Learning for Source Code

CIRCLE: Continual Repair across Programming Languages

Contact Info

Product

Resources

About