2020
DOI: 10.48550/arxiv.2006.12641
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exploring Software Naturalness through Neural Language Models

Luca Buratti,
Saurabh Pujar,
Mihaela Bornea
et al.

Abstract: The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 16 publications
(29 citation statements)
references
References 28 publications
0
29
0
Order By: Relevance
“…Syntax-based Generic Approaches: These approaches encode program snippets, either by dividing the program into strings, lexicalizing them into tokens or parsing the program into a parse tree or abstract syntax tree (AST). Syntax-only generic embedding approaches include Code2Vec [3], Code2Seq [2], CodeBERT [15], C-BERT [7], InferCode [6], CC2Vec [24], AST-based NN [65] and ProgHeteroGraph [59] (see Table 2). Notably, these approaches use neural models for representing code (snippets), e.g., via code vector (e.g., Code2Vec [3]), machine translation (e.g., Code2Seq [2]) or transformers (e.g., CodeBERT [15]).…”
Section: Background 21 Generic Code Embeddingmentioning
confidence: 99%
See 4 more Smart Citations
“…Syntax-based Generic Approaches: These approaches encode program snippets, either by dividing the program into strings, lexicalizing them into tokens or parsing the program into a parse tree or abstract syntax tree (AST). Syntax-only generic embedding approaches include Code2Vec [3], Code2Seq [2], CodeBERT [15], C-BERT [7], InferCode [6], CC2Vec [24], AST-based NN [65] and ProgHeteroGraph [59] (see Table 2). Notably, these approaches use neural models for representing code (snippets), e.g., via code vector (e.g., Code2Vec [3]), machine translation (e.g., Code2Seq [2]) or transformers (e.g., CodeBERT [15]).…”
Section: Background 21 Generic Code Embeddingmentioning
confidence: 99%
“…It leverages the syntactic structure of programming languages to encode source code by representing code snippets as the set of paths in the program's AST, then uses attention to select the relevant paths while decoding. Besides, CodeBERT [15], C-BERT [7] and Cu-BERT [29] are BERT-inspired approaches, these methods adopt similar methodologies to learn code representations as BERT [11]. CodeBERT [15] is a bimodal pre-trained model for programming language (PL) and natural language (NL) tasks, which uses transformer-based neural architecture to encode code snippets.…”
Section: Background 21 Generic Code Embeddingmentioning
confidence: 99%
See 3 more Smart Citations