IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Wainakh, Yaza; Rauf, Moiz; Pradel, Michael

doi:10.1109/icse43902.2021.00059

Cited by 17 publications

(4 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relevance or Similarity: Several studies define relevance as to how relevant is the model's output to the reference text or code [2,5,11,16]. Others asked developers to rate the similarity, relatedness, and contextual or semantic similarity between outputs and reference texts [9,10,15].…”

Section: Evaluation Of Nlp-based Modelsmentioning

confidence: 99%

On the Evaluation of NLP-based Models for Software Engineering

Izadi,

Ahmadabadi

2022

Preprint

View full text Add to dashboard Cite

NLP-based models have been increasingly incorporated to address SE problems. These models are either employed in the SE domain with little to no change, or they are greatly tailored to source code and its unique characteristics. Many of these approaches are considered to be outperforming or complementing existing solutions. However, an important question arises here: Are these models evaluated fairly and consistently in the SE community?. To answer this question, we reviewed how NLP-based models for SE problems are being evaluated by researchers. The findings indicate that currently there is no consistent and widely-accepted protocol for the evaluation of these models. While different aspects of the same task are being assessed in different studies, metrics are defined based on custom choices, rather than a system, and finally, answers are collected and interpreted case by case. Consequently, there is a dire need to provide a methodological way of evaluating NLP-based models to have a consistent assessment and preserve the possibility of fair and efficient comparison.

show abstract

Section: Evaluation Of Nlp-based Modelsmentioning

confidence: 99%

On the Evaluation of NLP-based Models for Software Engineering

Izadi,

Ahmadabadi

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To this end, the approach represents names and values as vectors that preserve their meaning. To represent identifier names, we build on learned token embeddings [13], which map each name into a vector while preserving the semantic similarities of names [54]. For example, the vector of probability will be close to the vectors of names probab and likelihood, because these names refer to similar concepts.…”

Section: Overviewmentioning

confidence: 99%

“…We build upon FastText [13], a neural word embedding known to represent the semantics of identifiers more accurately than other popular embeddings [54]. An additional key benefit of FastText is to avoid the out-of-vocabulary problem that other embeddings, e.g., Word2vec [36] suffer from, by splitting each token into n-grams and by computing a separate vector representation for each n-gram.…”

Section: Representation As Vectorsmentioning

confidence: 99%

Nalin: Learning from Runtime Behavior to Find Name-Value Inconsistencies in Jupyter Notebooks

Patra¹,

Pradel²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Variable names are important to understand and maintain code. If a variable name and the value stored in the variable do not match, then the program suffers from a name-value inconsistency, which is due to one of two situations that developers may want to fix: Either a correct value is referred to through a misleading name, which negatively affects code understandability and maintainability, or the correct name is bound to a wrong value, which may cause unexpected runtime behavior. Finding name-value inconsistencies is hard because it requires an understanding of the meaning of names and knowledge about the values assigned to a variable at runtime. This paper presents Nalin, a technique to automatically detect name-value inconsistencies. The approach combines a dynamic analysis that tracks assignments of values to names with a neural machine learning model that predicts whether a name and a value fit together. To the best of our knowledge, this is the first work to formulate the problem of finding coding issues as a classification problem over names and runtime values. We apply Nalin to 106,652 real-world Python programs, where meaningful names are particularly important due to the absence of statically declared types. Our results show that the classifier detects name-value inconsistencies with high accuracy, that the warnings reported by Nalin have a precision of 80% and a recall of 76% w.r.t. a ground truth created in a user study, and that our approach complements existing techniques for finding coding issues. CCS CONCEPTS• Software and its engineering → Software maintenance tools; Software post-development issues;

show abstract

“…As stated by Host and Ostvold (2007), even though naming is part of daily life for programmers, it entails a great deal of time and thought: names should convey to others the purpose of the code (Martin, 2008) and reflect the meaning of domain concepts (Marcus et al, 2004). Meaningful identifier names are key to bridging the gap between intention and implementation (Wainakh et al, 2021). Therefore, given that poorly chosen identifier names might hinder source code comprehension (Schankin et al, 2018), using meaningful identifier names is a recommended practice present in several coding style guides and conventions.…”

Section: Introductionmentioning

confidence: 99%

Naming Practices in Object-oriented Programming: An Empirical Study

Gresta

Durelli²,

Cirilo³

2023

JSERD

View full text Add to dashboard Cite

Currently, research indicates that comprehending code takes up far more developer time than writing code. Given that most modern programming languages place little to no limitations on identifier names, and so developers are allowed to choose identifier names at their own discretion, one key aspect of code comprehension is the naming of identifiers. Research in naming identifiers shows that informative names are crucial to improving the readability and maintainability of programs: essentially, intention-revealing names make code easier to understand and act as a basic form of documentation. Poorly named identifiers tend to hurt the comprehensibility and maintainability of software systems. However, most computer science curricula emphasize programming concepts and language syntax over naming guidelines and conventions. Consequently, programmers lack knowledge about naming practices. This article is an extension of our previous study on naming practices. Previously, we set out to explore naming practices of Java programmers. To this end, we analyzed 1,421,607 identifier names (i.e., attributes, parameters, and variables names) from 40 open-source Java projects and categorized these names into eight naming practices. As a follow-up study to further investigate naming practices, we examined 40 open-source C++ projects and categorized 1,181,774 identifier names according to the previously mentioned eight naming practices. We examined the occurrence and prevalence of these categories across C++ and Java projects and our results also highlight in which contexts identifiers following each naming practice tend to appear more regularly. Finally, we also conducted an online survey questionnaire with 52 software developers to gain insight from the industry. All in all, we believe the results based on the analysis of 2,603,381 identifier names can be helpful to enhance programmers’ awareness and contribute to improving educational materials and code review methods.

show abstract

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Cited by 17 publications

References 51 publications

On the Evaluation of NLP-based Models for Software Engineering

On the Evaluation of NLP-based Models for Software Engineering

Nalin: Learning from Runtime Behavior to Find Name-Value Inconsistencies in Jupyter Notebooks

Naming Practices in Object-oriented Programming: An Empirical Study

Contact Info

Product

Resources

About