Software system comparison with semantic source code embeddings

Karakatič, Sašo; Miloševič, Aleksej; Heričko, Tjaša

doi:10.1007/s10664-022-10122-9

Cited by 4 publications

(2 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kovačević et al [46] conducted experiments with the Code2Vec, Code2Seq, and CuBERT models to represent Java methods or classes as code embeddings, facilitating machine-learning-based detection of two code smells, i.e., long method and god class, while Ma et al [26] leveraged the CodeT5, CodeGPT, and CodeBERT models to detect the feature envy code smell. To compare software systems, Karakatič et al [47] utilized a pre-trained Code2Vec model to embed Java methods. The work of Fatima et al [27] employed the CodeBERT model to represent Java test cases, assisting in the prediction of flaky (i.e., non-deterministic) test cases.…”

Section: Pre-trained Models In Code-related Tasksmentioning

confidence: 99%

Commit-Level Software Change Intent Classification Using a Pre-Trained Transformer-Based Code Model

Heričko,

Šumak,

Karakatič

2024

Mathematics

Self Cite

View full text Add to dashboard Cite

Software evolution is driven by changes made during software development and maintenance. While source control systems effectively manage these changes at the commit level, the intent behind them are often inadequately documented, making understanding their rationale challenging. Existing commit intent classification approaches, largely reliant on commit messages, only partially capture the underlying intent, predominantly due to the messages’ inadequate content and neglect of the semantic nuances in code changes. This paper presents a novel method for extracting semantic features from commits based on modifications in the source code, where each commit is represented by one or more fine-grained conjoint code changes, e.g., file-level or hunk-level changes. To address the unstructured nature of code, the method leverages a pre-trained transformer-based code model, further trained through task-adaptive pre-training and fine-tuning on the downstream task of intent classification. This fine-tuned task-adapted pre-trained code model is then utilized to embed fine-grained conjoint changes in a commit, which are aggregated into a unified commit-level vector representation. The proposed method was evaluated using two BERT-based code models, i.e., CodeBERT and GraphCodeBERT, and various aggregation techniques on data from open-source Java software projects. The results show that the proposed method can be used to effectively extract commit embeddings as features for commit intent classification and outperform current state-of-the-art methods of code commit representation for intent categorization in terms of software maintenance activities undertaken by commits.

show abstract

Section: Pre-trained Models In Code-related Tasksmentioning

confidence: 99%

Commit-Level Software Change Intent Classification Using a Pre-Trained Transformer-Based Code Model

Heričko,

Šumak,

Karakatič

2024

Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Karakatic et al [92] introduced a novel method for comparing software systems by computing the robust Hausdorff distance between semantic source code embeddings of each program component. The authors utilized a pre-trained neural network model, code2vec, to generate source code vector representations from various open-source libraries.…”

Section: Duplicate Code Detection and Similaritymentioning

confidence: 99%

AI-Assisted Programming Tasks Using Code Embeddings and Transformers

Kotsiantis,

Verykios,

Tzagarakis

2024

Electronics

View full text Add to dashboard Cite

This review article provides an in-depth analysis of the growing field of AI-assisted programming tasks, specifically focusing on the use of code embeddings and transformers. With the increasing complexity and scale of software development, traditional programming methods are becoming more time-consuming and error-prone. As a result, researchers have turned to the application of artificial intelligence to assist with various programming tasks, including code completion, bug detection, and code summarization. The utilization of artificial intelligence for programming tasks has garnered significant attention in recent times, with numerous approaches adopting code embeddings or transformer technologies as their foundation. While these technologies are popular in this field today, a rigorous discussion, analysis, and comparison of their abilities to cover AI-assisted programming tasks is still lacking. This article discusses the role of code embeddings and transformers in enhancing the performance of AI-assisted programming tasks, highlighting their capabilities, limitations, and future potential in an attempt to outline a future roadmap for these specific technologies.

show abstract