Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Langu 2020
DOI: 10.1145/3416506.3423580
|View full text |Cite
|
Sign up to set email alerts
|

Towards demystifying dimensions of source code embeddings

Abstract: Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the con… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(5 citation statements)
references
References 33 publications
0
5
0
Order By: Relevance
“…As Rabin et al (2020) observed, few manually engineered features can perform very close to the higher dimensional code2vec embeddings. Thus, it is necessary to include handcrafted features as baselines.…”
Section: Introductionmentioning
confidence: 55%
See 1 more Smart Citation
“…As Rabin et al (2020) observed, few manually engineered features can perform very close to the higher dimensional code2vec embeddings. Thus, it is necessary to include handcrafted features as baselines.…”
Section: Introductionmentioning
confidence: 55%
“…An alternative to hand-crafting features is to automatically infer helpful features through deep learning (section 6). However, this approach may lead to a slight performance improvement (Rabin et al, 2020) while sacrificing model interpretability. Thus, it is vital to include models trained on manually engineered features as baselines to estimate if the performance improvement justifies the added model complexity (Allamanis et al, 2018).…”
Section: Classifier Trained On Code Metricsmentioning
confidence: 99%
“…Allamanis et al [26] showed that adding features that capture global context can increase the performance of a model. Rabin et al [27] found that code complexity features can improve the classification performance of some labels up to about 7%. While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior.…”
Section: Related Workmentioning
confidence: 99%
“…While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior. Recent studies have shown that state-of-the-art models heavily rely on variables [13,28], specific tokens [29], and even structures [30]. Chen et al [31] focus on semantic representations of program variables, and study how well models can learn similarity between variables that have similar meaning (e.g., minimum and minimal).…”
Section: Related Workmentioning
confidence: 99%
“…Rabin et al [17] evaluated the use of code2vec embeddings compared to handcrafted features for machine learning tasks, finding that code2vec embeddings offered even information gains distribution and exhibited resilience to dimension removal compared to handcrafted feature vectors.…”
mentioning
confidence: 99%