2021
DOI: 10.48550/arxiv.2102.07492
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Abstract: Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pretraining when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…• CodeBERT CodeBERT (Feng et al 2020) uses the BERT architecture pre-trained on source code corpus. • DOBF DOBF (Roziere et al 2021) is the model from which the weights are used to initialize our model. It is pre-trained on Java and Python.…”
Section: Baseline Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…• CodeBERT CodeBERT (Feng et al 2020) uses the BERT architecture pre-trained on source code corpus. • DOBF DOBF (Roziere et al 2021) is the model from which the weights are used to initialize our model. It is pre-trained on Java and Python.…”
Section: Baseline Methodsmentioning
confidence: 99%
“…We initialize the model parameters with the pre-trained weights of the DOBF model (Roziere et al 2021). DOBF is a Transformer-based model trained with masked language modeling (MLM) and code deobfusctation objectives on Python and Java files from GitHub public dataset available on Google BigQuery.…”
Section: Model Initializationmentioning
confidence: 99%
See 2 more Smart Citations
“…Their approach achieves outstanding effectiveness. Later on, they presented DOBF (Rozière et al 2021a) and TransCoder-ST (Rozière et al 2021b), the former pretrains a model to revert the code obfuscation function by training a sequence-tosequence model; the latter uses automatic test generation techniques to automatically select high-quality translation pairs to fine-tune the pre-trained model. These works use Computational Accuracy (CA), a measure to evaluate the translated code, which is based on the ratio of test cases that have similar outputs between the input program and its translation.…”
Section: Code Translationmentioning
confidence: 99%