2021
DOI: 10.1007/s42979-021-00566-z
|View full text |Cite
|
Sign up to set email alerts
|

Commit2Vec: Learning Distributed Representations of Code Changes

Abstract: Deep learning methods have found successful applications in fields like image classification and natural language processing. They have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach for source code representation, which uses information about its syntactic structure, and we extend it to represent source code changes (i.e., commits). We use t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(7 citation statements)
references
References 23 publications
0
7
0
Order By: Relevance
“…Notably, the authors successfully fixed 66 inconsistent method names in a live study on projects in the wild. Cabrera Lozoya et al [100] extended a state-of-the-art approach for representing source code to also include changes in the source code (commits). Transfer learning was then applied to classify security-relevant commits.…”
Section: Code Refinementmentioning
confidence: 99%
“…Notably, the authors successfully fixed 66 inconsistent method names in a live study on projects in the wild. Cabrera Lozoya et al [100] extended a state-of-the-art approach for representing source code to also include changes in the source code (commits). Transfer learning was then applied to classify security-relevant commits.…”
Section: Code Refinementmentioning
confidence: 99%
“…Keshav Ram ( 2020) used a linear SVM on TF-IDF features of Java tokens extracted from code changes, Code2Vec (Alon et al, 2019), convolutional neural network on code changes with surrounding code as context, and BiLSTM (Bi-directional Long Short-Term Memory) on code changes without context to classify commits, remarking that the Code2Vec approach yielded poor results. (Lozoya et al, 2021) proposed a method inspired by Code2Vec to compute vectors from code changes, trained on the pretext task of classifying commits' Jira Ticket Priorities before classifying security-fix commits. Many aforementioned works were conducted under the fullysupervised setting -a setting that can be a stretch for new categories where labeled data is scarce.…”
Section: Related Workmentioning
confidence: 99%
“…Building a semantically rich representations of changesets is relevant to other software engineering applications beyond bug localization, i.e., just-in-time defect prediction, recommendation of a code reviewer for a patch, tangled change prediction. Approaches that define novel changeset embeddings (vector representations of changeset), including CC2Vec [15] and Commit2Vec [28], leverage the difference between added and removed lines of code, among other changeset characteristics. Corley et al [7] studied how including different types of lines from a changeset affects the performance of Latent Dirichlet Allocation-based feature location, observing that including context, additions, and log messages, but excluding removed lines, achieves the best performance.…”
Section: Changeset Representationmentioning
confidence: 99%