Toward Deep Learning Software Repositories

White, Martin; Vendome, Christopher; Linares‐Vásquez, Mario; Poshyvanyk, Denys

doi:10.1109/msr.2015.38

“…White et al (White et al, 2015) trained RNNs on source code and showed their practicality in code completion. Similarly, Raychev et al (Raychev et al, 2014) used RNNs in code completion to synthesize method call chains in Java code.…”

Section: Prior Workmentioning

confidence: 99%

“…For this task, we employed long short-term memory (LSTM) recurrent neural networks, as they were successfully used by prior work in predicting tokens from source code (Raychev et al, 2014;White et al, 2015). Unlike the prior work, we have trained two models-the forwards model, given a prefix context and returning the distribution of the next token; and the backwards model, given a suffix context and returning the distribution of the previous token.…”

Section: Training the Lstmsmentioning

confidence: 99%

“…The mental model of the programmer may be something like a language model for speech, but rather applied to code. Language models are typically applied to natural human utterances but they have also been successfully applied to software (Hindle et al, 2012;Raychev et al, 2014;White et al, 2015), and can be used to discover unexpected segments of tokens in source code (Campbell et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

“…Thus GrammarGuru uses language models to capture code regularity or naturalness and then looks for irregular code (Campbell et al, 2014). Once the location of a potential error is found, code completion techniques that exploit language models (Hindle et al, 2012;Raychev et al, 2014;White et al, 2015) can be used to suggest possible fixes. Traditional parsers do not rely upon such information.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Finding and correcting syntax errors using recurrent neural networks

Santos

¹

,

Campbell

²

,

Hindle

³

et al. 2017

Preprint

View full text Add to dashboard Cite

Minor syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of intuition that help them resolve these tiny errors. Standard LR parsers typically resolve syntax errors and their precise location poorly. We propose a methodology that helps locate where syntax errors occur, but also suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by checking if two language models "agree" on each token. If the models disagree, it indicates a possible syntax error; the methodology tries to suggest a fix by finding an alternative token sequence obtained from the models. We trained two LSTM (Long short-term memory) language models on a large corpus of JavaScript code collected from GitHub. The dual LSTM neural network model predicts the correct location of the syntax error 54.74% in its top 4 suggestions and produces an exact fix up to 35.50% of the time. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

show abstract

“…Based on empirical work done by White et al (White et al, 2015) we chose a context length τ of 20 tokens. This corresponds to an n-gram length of 21 tokens, as an n-gram traditionally includes both the context and the adjacent token.…”

Section: Training the Lstmsmentioning

confidence: 99%

Finding and correcting syntax errors using recurrent neural networks

Santos

¹

,

Campbell

²

,

Hindle

³

et al. 2017

Preprint

View full text Add to dashboard Cite

Minor syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of intuition that help them resolve these tiny errors. Standard LR parsers typically resolve syntax errors and their precise location poorly. We propose a methodology that helps locate where syntax errors occur, but also suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by checking if two language models "agree" on each token. If the models disagree, it indicates a possible syntax error; the methodology tries to suggest a fix by finding an alternative token sequence obtained from the models. We trained two LSTM (Long short-term memory) language models on a large corpus of JavaScript code collected from GitHub. The dual LSTM neural network model predicts the correct location of the syntax error 54.74% in its top 4 suggestions and produces an exact fix up to 35.50% of the time. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

show abstract

Deep learning the semantics of change sequences for query expansion

Huang

¹

,

Yang

²

,

Cheng

³

2019

View full text Add to dashboard Cite

The overexpansion problem negatively affects the quality of query expansion.To improve the quality of queries for searching code, this paper proposed a DBN-based algorithm for effective query expansion. The deep belief network (DBN) model is trained on the code sequences and their change sequences, which aims to capture the meaningful terms during the evolution of source code. In contrast to previous studies, the proposed model not only extracts relevant terms to expand a query but also excludes irrelevant terms from the query.It addresses two problems in query expansion, including the overexpansion of the original query and the negative influence of the changed terms in the target source code. Experiments on both artificial queries and real queries show that the proposed algorithm outperforms several query expansion algorithms for code search. KEYWORDSchange sequence, code search, deep learning, query expansion, semantics 1. A query q(A 1 ) contains only one term of A, the search engine does not return m AB or m AC as the query contains too few relevant terms.

show abstract

Toward Deep Learning Software Repositories

Cited by 214 publications

References 49 publications

Finding and correcting syntax errors using recurrent neural networks

Finding and correcting syntax errors using recurrent neural networks

Finding and correcting syntax errors using recurrent neural networks

Deep learning the semantics of change sequences for query expansion

Contact Info

Product

Resources

About