Moiz Rauf scite author profile

Moiz Rauf

3Publications

5Citation Statements Received

170Citation Statements Given

How they've been cited

How they cite others

169

Affiliations

University of Stuttgart

Publications

Order By: Most citations

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Wainakh

Rauf

Pradel

2021

View full text Add to dashboard Cite

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of namebased analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., le n and s i z e , are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.Index Terms-source code, neural networks, embeddings, identifiers, benchmark I. In t r o d u c t io n Identifier names play an important role in writing, understanding, and maintaining high-quality source code [1]. Because they convey information about the meaning of variables, functions, classes, and other program elements, developers often rely on identifiers to understand code written by themselves and others. Beyond developers, various automated techniques analyze, use, and improve identifier names. For example, identifiers have been used to find programming errors [2 ]-[5], to mine specifications [6 ], to infer types [7], [8 ], to predict the name of a method [9], or to complete partial code using a learned language model [10]. Techniques for

show abstract

Meta Learning for Code Summarization

Rauf¹,

Padó²,

Pradel³

2022

Preprint

View full text Add to dashboard Cite

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Wainakh¹,

Rauf²,

Pradel³

2019

Preprint

View full text Add to dashboard Cite

Learned representations of source code enable various software developer tools, e.g., to detect bugs or to predict program properties. At the core of code representations often are word embeddings of identifier names in source code, because identifiers account for the majority of source code vocabulary and convey important semantic information. Unfortunately, there currently is no generally accepted way of evaluating the quality of word embeddings of identifiers, and current evaluations are biased toward specific downstream tasks. This paper presents IdBench, the first benchmark for evaluating to what extent word embeddings of identifiers represent semantic relatedness and similarity. The benchmark is based on thousands of ratings gathered by surveying 500 software developers. We use IdBench to evaluate state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions, as these are often used in current developer tools. Our results show that the effectiveness of embeddings varies significantly across different embedding techniques and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing embedding provides a satisfactory representation of semantic similarities, e.g., because embeddings consider identifiers with opposing meanings as similar, which may lead to fatal mistakes in downstream developer tools. IdBench provides a gold standard to guide the development of novel embeddings that address the current limitations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Moiz Rauf

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Meta Learning for Code Summarization

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Contact Info

Product

Resources

About