Towards demystifying dimensions of source code embeddings

Rabin, Md Rafiqul Islam; Mukherjee, Arjun; Gnawali, Omprakash; Alipour, Mohammad Amin

doi:10.1145/3416506.3423580

Cited by 14 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As Rabin et al (2020) observed, few manually engineered features can perform very close to the higher dimensional code2vec embeddings. Thus, it is necessary to include handcrafted features as baselines.…”

Section: Introductionmentioning

confidence: 55%

“…An alternative to hand-crafting features is to automatically infer helpful features through deep learning (section 6). However, this approach may lead to a slight performance improvement (Rabin et al, 2020) while sacrificing model interpretability. Thus, it is vital to include models trained on manually engineered features as baselines to estimate if the performance improvement justifies the added model complexity (Allamanis et al, 2018).…”

Section: Classifier Trained On Code Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Automatic detection of Long Method and God Class code smells through neural source code embeddings

Kovačević¹,

Slivka²,

Vidaković³

et al. 2021

Preprint

View full text Add to dashboard Cite

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

show abstract

Section: Introductionmentioning

confidence: 55%

Section: Classifier Trained On Code Metricsmentioning

confidence: 99%

Automatic detection of Long Method and God Class code smells through neural source code embeddings

Kovačević¹,

Slivka²,

Vidaković³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Allamanis et al [26] showed that adding features that capture global context can increase the performance of a model. Rabin et al [27] found that code complexity features can improve the classification performance of some labels up to about 7%. While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior.…”

Section: Related Workmentioning

confidence: 99%

“…While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior. Recent studies have shown that state-of-the-art models heavily rely on variables [13,28], specific tokens [29], and even structures [30]. Chen et al [31] focus on semantic representations of program variables, and study how well models can learn similarity between variables that have similar meaning (e.g., minimum and minimal).…”

Section: Related Workmentioning

confidence: 99%

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Hussain¹,

Rabin²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introduced in the works of Sajaniemi et al. [1,2]) has been found to help students' abilities in programming. In this paper, we investigate if this notion would improve the performance of neural models of code. To the best of our knowledge, this is the first work to investigate how Sajaniemi et al.'s concept of variable roles can affect neural models of code. In particular, we enrich a source code dataset by adding the role of individual variables in the dataset programs, and thereby conduct a study on the impact of variable role enrichment in training the Code2Seq model. In addition, we shed light on some challenges and opportunities in feature enrichment for neural code intelligence models.

show abstract

“…Rabin et al [17] evaluated the use of code2vec embeddings compared to handcrafted features for machine learning tasks, finding that code2vec embeddings offered even information gains distribution and exhibited resilience to dimension removal compared to handcrafted feature vectors.…”

mentioning

confidence: 99%

AI-Assisted Programming Tasks Using Code Embeddings and Transformers

Kotsiantis,

Verykios,

Tzagarakis

2024

Electronics

View full text Add to dashboard Cite

This review article provides an in-depth analysis of the growing field of AI-assisted programming tasks, specifically focusing on the use of code embeddings and transformers. With the increasing complexity and scale of software development, traditional programming methods are becoming more time-consuming and error-prone. As a result, researchers have turned to the application of artificial intelligence to assist with various programming tasks, including code completion, bug detection, and code summarization. The utilization of artificial intelligence for programming tasks has garnered significant attention in recent times, with numerous approaches adopting code embeddings or transformer technologies as their foundation. While these technologies are popular in this field today, a rigorous discussion, analysis, and comparison of their abilities to cover AI-assisted programming tasks is still lacking. This article discusses the role of code embeddings and transformers in enhancing the performance of AI-assisted programming tasks, highlighting their capabilities, limitations, and future potential in an attempt to outline a future roadmap for these specific technologies.

show abstract

Towards demystifying dimensions of source code embeddings

Cited by 14 publications

References 33 publications

Automatic detection of Long Method and God Class code smells through neural source code embeddings

Automatic detection of Long Method and God Class code smells through neural source code embeddings

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

AI-Assisted Programming Tasks Using Code Embeddings and Transformers

Contact Info

Product

Resources

About