2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2021
DOI: 10.1109/ase51524.2021.9678927
|View full text |Cite
|
Sign up to set email alerts
|

What do pre-trained code models know about code?

Abstract: Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little about what these pre-trained models learn about source code. In this article, we use probing-simple diagnostic tasks that do not further train the models-to discover to what extent pre-trained models learn about specific aspects of source code. We use an extensible framework to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 47 publications
(12 citation statements)
references
References 70 publications
1
11
0
Order By: Relevance
“…These pre-trained models have brought breakthrough changes to many downstream code-based tasks [21], including both classification tasks and generation tasks, by fine-tuning them on the datasets of the corresponding tasks. The former makes classification based on the given code snippets (e.g., clone detection [16] and vulnerability prediction [2]), while the latter produces a sequence of information based on code snippets or natural language descriptions (e.g., code completion [3] and code summarization [22]).…”
Section: A Deep Code Modelsmentioning
confidence: 99%
“…These pre-trained models have brought breakthrough changes to many downstream code-based tasks [21], including both classification tasks and generation tasks, by fine-tuning them on the datasets of the corresponding tasks. The former makes classification based on the given code snippets (e.g., clone detection [16] and vulnerability prediction [2]), while the latter produces a sequence of information based on code snippets or natural language descriptions (e.g., code completion [3] and code summarization [22]).…”
Section: A Deep Code Modelsmentioning
confidence: 99%
“…• Design more efficient pre-training tasks to make Code-PTMs learn source code features better [20].…”
Section: Insights and Takeawaysmentioning
confidence: 99%
“…Initial applications of pre-trained models in SE have primarily involved retraining PTM-NLs on source code [12]- [16]. Nevertheless, employing the resulting retrained models (henceforth PTM-Cs) for SE tasks is not ideal, as there are code-specific characteristics that may not be properly taken into account by these models, such as the syntactic [17], [18] and semantic structures [19] inherent in source code [20]. Consequently, SE researchers have developed a number of pre-trained models of source code (henceforth CodePTMs) that take into account code-specific characteristics in the past few years [21]- [26].…”
Section: Introductionmentioning
confidence: 99%
“…RQ5 Design: We employ probing experiments to assess the hidden state embeddings of multiple models and measure their ability to capture fundamental characteristics related to code. We adopt three probing tasks of code length prediction, cyclomatic complexity and invalid type detection [14]. These tasks correspond to probing surface-level, syntactic and semantic information of source code, respectively.…”
Section: Rq3: How Effective Is Adapter Tuning Over Multilingualmentioning
confidence: 99%