DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Rozière, Baptiste; Lachaux, Marie-Anne; Szafraniec, Marc; Lample, Guillaume

doi:10.48550/arxiv.2102.07492

Cited by 4 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• CodeBERT CodeBERT (Feng et al 2020) uses the BERT architecture pre-trained on source code corpus. • DOBF DOBF (Roziere et al 2021) is the model from which the weights are used to initialize our model. It is pre-trained on Java and Python.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…We initialize the model parameters with the pre-trained weights of the DOBF model (Roziere et al 2021). DOBF is a Transformer-based model trained with masked language modeling (MLM) and code deobfusctation objectives on Python and Java files from GitHub public dataset available on Google BigQuery.…”

Section: Model Initializationmentioning

confidence: 99%

“…The DOBF model we used for initializing our model is dobf plus denoising.pth, which can be found on their GitHub repository. Most of the settings during training were the same as DOBF (Roziere et al 2021). Float 16 operations were used to speed up the training.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…In summary, the contributions of this paper are listed below: ) also make use of abstract syntax tree (AST) derived from the code. DOBF (Roziere et al 2021) added a de-obfuscation objective to the masked language model pre-training to leverage the structural aspect of programming languages. Datasets: Many preceding works (Lu et al 2021;Chen, Liu, and Song 2018;Nguyen, Nguyen, and Nguyen 2015;Karaivanov, Raychev, and Vechev 2014;Nguyen, Nguyen, and Nguyen 2013) consist of parallel Java-C# code from various open source projects.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multilingual Code Snippets Training for Program Translation

Zhu

Suresh

Reddy

2022

AAAI

View full text Add to dashboard Cite

Program translation aims to translate source code from one programming language to another. It is particularly useful in applications such as multiple-platform adaptation and legacy code migration. Traditional rule-based program translation methods usually rely on meticulous manual rule-crafting, which is costly both in terms of time and effort. Recently, neural network based methods have been developed to address this problem. However, the absence of high-quality parallel code data is one of the main bottlenecks which impedes the development of program translation models. In this paper, we introduce CoST, a new multilingual Code Snippet Translation dataset that contains parallel data from 7 commonly used programming languages. The dataset is parallel at the level of code snippets, which provides much more fine-grained alignments between different languages than the existing translation datasets. We also propose a new program translation model that leverages multilingual snippet denoising auto-encoding and Multilingual Snippet Translation (MuST) pre-training. Extensive experiments show that the multilingual snippet training is effective in improving program translation performance, especially for low-resource languages. Moreover, our training method shows good generalizability and consistently improves the translation performance of a number of baseline models. The proposed model outperforms the baselines on both snippet-level and program-level translation, and achieves state-of-the-art performance on CodeXGLUE translation task. The code, data, and appendix for this paper can be found at https://github.com/reddy-lab-code-research/MuST-CoST.

show abstract

Section: Baseline Methodsmentioning

confidence: 99%

Section: Model Initializationmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multilingual Code Snippets Training for Program Translation

Zhu

Suresh

Reddy

2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Their approach achieves outstanding effectiveness. Later on, they presented DOBF (Rozière et al 2021a) and TransCoder-ST (Rozière et al 2021b), the former pretrains a model to revert the code obfuscation function by training a sequence-tosequence model; the latter uses automatic test generation techniques to automatically select high-quality translation pairs to fine-tune the pre-trained model. These works use Computational Accuracy (CA), a measure to evaluate the translated code, which is based on the ratio of test cases that have similar outputs between the input program and its translation.…”

Section: Code Translationmentioning

confidence: 99%

Mutation analysis for evaluating code translation

Guizzo,

Zhang,

Sarro

et al. 2023

Empir Software Eng

View full text Add to dashboard Cite

Source-to-source code translation automatically translates a program from one programming language to another. The existing research on code translation evaluates the effectiveness of their approaches by using either syntactic similarities (e.g., BLEU score), or test execution results. The former does not consider semantics, the latter considers semantics but falls short on the problem of insufficient data and tests. In this paper, we propose MBTA (Mutation-based Code Translation Analysis), a novel application of mutation analysis for code translation assessment. We also introduce MTS (Mutation-based Translation Score), a measure to compute the level of trustworthiness of a translator. If a mutant of an input program shows different test execution results from its translated version, the mutant is killed and a translation bug is revealed. Fewer killed mutants indicate better code translation. MBTA is novel in the sense that mutants are compared to their translated counterparts, and not to their original program’s translation. We conduct a proof-of-concept case study with 612 Java-Python program pairs and 75,082 mutants on the code translators TransCoder and j2py to evaluate the feasibility of MBTA. The results reveal that TransCoder and j2py fail to translate 70.44% and 70.64% of the mutants, respectively, i.e., more than two-thirds of all mutants are incorrectly translated by these translators. By analysing the MTS results more closely, we were able to reveal translation bugs not captured by the conventional comparison between the original and translated programs.

show abstract

An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code

Hort,

Grishina,

Moonen

2023

2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

View full text Add to dashboard Cite

Large language models trained on source code can support a variety of software development tasks, such as code recommendation and program repair. Large amounts of data for training such models benefit the models' performance. However, the size of the data and models results in long training times and high energy consumption. While publishing source code allows for replicability, users need to repeat the expensive training process if models are not shared.GOALS: The main goal of the study is to investigate if publications that trained language models for software engineering (SE) tasks share source code and trained artifacts. The second goal is to analyze the transparency on training energy usage.METHODS: We perform a snowballing-based literature search to find publications on language models for source code, and analyze their reusability from a sustainability standpoint.RESULTS: From a total of 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. Among them, 27% (79 out of 293) make artifacts available for reuse. This can be in the form of tools or IDE plugins designed for specific tasks or task-agnostic models that can be fine-tuned for a variety of downstream tasks. Moreover, we collect insights on the hardware used for model training, as well as training time, which together determine the energy consumption of the development process.CONCLUSION: We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks, with 40% of the surveyed papers not sharing source code or trained artifacts. We recommend the sharing of source code as well as trained artifacts, to enable sustainable reproducibility. Moreover, comprehensive information on training times and hardware configurations should be shared for transparency on a model's carbon footprint.

show abstract

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Cited by 4 publications

References 22 publications

Multilingual Code Snippets Training for Program Translation

Multilingual Code Snippets Training for Program Translation

Mutation analysis for evaluating code translation

An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code

Contact Info

Product

Resources

About