On the localness of software

Tu, Zhaopeng; Su, Zhendong; Dévanbu, Prémkumar

doi:10.1145/2635868.2635875

Cited by 232 publications

(203 citation statements)

References 51 publications

Supporting

Mentioning

200

Contrasting

Order By: Relevance

“…Following this work, language models have been used to good effect in code suggestion [22,48,53,15], cross-language porting [38,37,39,24], coding standards [2], idiom mining [3], and code deobfuscation [47]. Since language models are useful in these tasks, ⇤ Baishakhi Ray and Vincent Hellendoorn are both first authors, and contributed equally to the work.…”

Section: Introductionmentioning

confidence: 99%

On the "naturalness" of buggy code

Ray

Hellendoorn

Godhane

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

Self Cite

206

141

View full text Add to dashboard Cite

Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be "natural", like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is "unnatural" in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 7,139), from 10 different Java projects, and focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Ordering files for inspection by their average entropy yields cost-effectiveness scores comparable to popular defect prediction methods. At a finer granularity, focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.

show abstract

Section: Introductionmentioning

confidence: 99%

On the "naturalness" of buggy code

Ray

Hellendoorn

Godhane

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

Self Cite

206

141

View full text Add to dashboard Cite

show abstract

“…Tu et al [28] however, argue that code tokenization is enough for the n-gram language models. They also argue that n-gram models will not be useful when a particular context is not present in the source code corpora used to train the model.…”

Section: Methods Call Recommendersmentioning

confidence: 99%

Stepwise API usage assistance using n -gram language models

Santos

Prendi

Sousa

et al. 2017

Journal of Systems and Software

View full text Add to dashboard Cite

Reusing software involves learning third-party APIs, a process that is often time-consuming and error-prone. Recommendation systems for API usage assistance based on statistical models built from source code corpora are capable of assisting API users through code completion mechanisms in IDEs. A valid sequence of API calls involving different types may be regarded as a well-formed sentence of tokens from the API vocabulary. In this article we describe an approach for recommending subsequent tokens to complete API sentences using n-gram language models built from source code corpora. The provided system was integrated in the code completion facilities of the Eclipse IDE, providing contextualized completion proposals for Java taking into account the nearest lines of code. The approach was evaluated against existing client code of four widely used APIs, revealing that in more than 90% of the cases the expected subsequent token is within the 10-top-most proposals of our models. The high score provides evidence that the recommendations could help on API learning and exploration, namely through the assistance on writing valid API sentences.

show abstract

“…Tu et al [3] sought to confirm that software is localized 2 Working on top of the fact that software is natural, they sought to find that there are "local regularities [in software] than be captured and exploited." They found, empirically, that this is the case.…”

Section: Prior Workmentioning

confidence: 99%

“…Since Gamboge uses a simple n-gram model, extending its prediction backend with the cache language module developed by Tu et al [3] may be beneficial. Since they have already shown that in single-token contexts, it surpasses a bare n-gram language model in suggestion performance, it may also improve suggestion performance for multi-token predicting.…”

Section: Combination With Other Code Suggestion Enginesmentioning

confidence: 99%

“…As mentioned in [1] and [3], naturalness and localness have several applications that are being lost on the crowd that are trying to make better code suggestion engines. However, in order to maintain large, and useful language models for the most popular languages, the storage and computation of such models may be used as part of an online service that, ostensibly, is intended only for providing robust code completion.…”

Section: Other Workmentioning

confidence: 99%

See 1 more Smart Citation