Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss crosscutting and application-specific challenges and opportunities.1 It may be worth pointing out that deep learning and probabilistic modeling are not mutually exclusive. Indeed, many of the currently most effective methods for language modeling, for example, are based on deep learning. of probabilistic source code models (Section 5). Finally, we mention a few overlapping research areas (Section 7), and we discuss challenges and interesting future directions (Section 6).Related Reviews and other Resources. There have been short reviews summarizing the progress and the vision of the research area, from both software engineering [52] and programming languages perspectives [28,195]. However, none of these articles can be considered extensive literature reviews, which is the purpose of this work. Ernst [57] discusses promising areas of applying natural language processing to software development, including error messages, variable names, code comments, and user questions. Some resources, datasets and code can be found at http://learnbigcode.github.io/. An online version of the work reviewed here -which we will keep up-to-date by accepting external contributions -can be found at https://ml4code.github.io.
THE NATURALNESS HYPOTHESISMany aspects of code, such as names, formatting, the lexical order of methods, etc. have no impact on program semantics. This is precisely why we abstract them in most program analyses. But then, why should statistical properties of code matter at all? To explain this, we recently suggested a hypothesis, called the naturalness hypothesis. The inspiration for the naturalness hypothesis can be traced back to the "literate programming" concept of D. Knuth, which draws from the insight that programming is a form of human communication: "Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do... " [105] The naturalness hypothesis, then, holds thatThe naturalness hypothesis. Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.The exploitation of the statistics of human communication is a mature and effective technology, with numerous applications ...