Learning to rank code examples for code search engines

Niu, Haoran; Keivanloo, Iman; Zou, Ying

doi:10.1007/s10664-015-9421-5

Cited by 75 publications

(53 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Software engineering researchers have focused on the code search problem using information retrieval (IR) methods [68,92,128,178]. Niu et al [145] has used learning-to-rank methods but with manually extracted features. Within the area of statistical models of source code, Gu et al [76] train a sequence-to-sequence (seq2seq) neural network to map natural language into API sequences.…”

Section: Documentation Traceability and Information Retrievalmentioning

confidence: 99%

A Survey of Machine Learning for Big Code and Naturalness

et al. 2018

View full text Add to dashboard Cite

Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss crosscutting and application-specific challenges and opportunities.1 It may be worth pointing out that deep learning and probabilistic modeling are not mutually exclusive. Indeed, many of the currently most effective methods for language modeling, for example, are based on deep learning. of probabilistic source code models (Section 5). Finally, we mention a few overlapping research areas (Section 7), and we discuss challenges and interesting future directions (Section 6).Related Reviews and other Resources. There have been short reviews summarizing the progress and the vision of the research area, from both software engineering [52] and programming languages perspectives [28,195]. However, none of these articles can be considered extensive literature reviews, which is the purpose of this work. Ernst [57] discusses promising areas of applying natural language processing to software development, including error messages, variable names, code comments, and user questions. Some resources, datasets and code can be found at http://learnbigcode.github.io/. An online version of the work reviewed here -which we will keep up-to-date by accepting external contributions -can be found at https://ml4code.github.io. THE NATURALNESS HYPOTHESISMany aspects of code, such as names, formatting, the lexical order of methods, etc. have no impact on program semantics. This is precisely why we abstract them in most program analyses. But then, why should statistical properties of code matter at all? To explain this, we recently suggested a hypothesis, called the naturalness hypothesis. The inspiration for the naturalness hypothesis can be traced back to the "literate programming" concept of D. Knuth, which draws from the insight that programming is a form of human communication: "Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do... " [105] The naturalness hypothesis, then, holds thatThe naturalness hypothesis. Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.The exploitation of the statistics of human communication is a mature and effective technology, with numerous applications ...

show abstract

Section: Documentation Traceability and Information Retrievalmentioning

confidence: 99%

A Survey of Machine Learning for Big Code and Naturalness

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Mining API usage examples. Complementing the tools aforementioned, many studies confirmed the the significance of API usage examples, mainly in the context of framework APIs, and proposed approaches to mine API usage examples from open code repositories and search engines [28]- [33]. Most of these work retrieve the so-called code snippets to support API learning, whereas our work focus on complete projects of framework code samples.…”

Section: Related Workmentioning

confidence: 96%

Framework Code Samples: How Are They Maintained and Used by Developers?

Menezes¹,

Cafeo²,

Hora³

2019

2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

View full text Add to dashboard Cite

Background: Modern software systems are commonly built on the top of frameworks. To accelerate the learning process of features provided by frameworks, code samples are made available to assist developers. However, we know little about how code samples are actually developed. Aims: In this paper, we aim to fill this gap by assessing the characteristics of framework code samples. We provide insights on how code samples are maintained and used by developers. Method: We analyze 233 code samples of Android and SpringBoot, and assess aspects related to their source code, evolution, popularity, and client usage. Results: We find that most code samples are small and simple, provide a working environment to the clients, and rely on automated build tools. They change frequently over time, for example, to adapt to new framework versions. We also detect that clients commonly fork the code samples, however, they rarely modify them. Conclusions: We provide a set of lessons learned and implications to creators and clients of code samples to improve maintenance and usage activities. 21 https://github.com/spring-guides 22 https://scitools.com

show abstract

“…Request permissions from permissions@acm.org. MSR '18, May [28][29]2018 predictable in a statistical sense. This statistical predictability enabled researchers to expand from models of source code and natural language (NL) created using hand-crafted rules, which have a long history [23], to data-driven models that have proven flexible, relatively easy-to-create, and often more effective than corresponding hand-crafted precursors [13,27].…”

Section: Introductionmentioning

confidence: 99%

Learning to mine aligned code and natural language pairs from stack overflow

Yin

Deng

Chen

et al. 2018

Proceedings of the 15th International Conference on Mining Software Repositories

162

127

View full text Add to dashboard Cite

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

show abstract

Learning to rank code examples for code search engines

Cited by 75 publications

References 45 publications

A Survey of Machine Learning for Big Code and Naturalness

A Survey of Machine Learning for Big Code and Naturalness

Framework Code Samples: How Are They Maintained and Used by Developers?

Learning to mine aligned code and natural language pairs from stack overflow

Contact Info

Product

Resources

About