Towards Full-line Code Completion with Neural Language Models

Wang, Wenhan; Shen, Sijie; Li, Ge; Jin, Zhi

doi:10.48550/arxiv.2009.08603

Cited by 7 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the ETH PY150 python dataset (the standard code completion benchmark) provided by Raychev et al [13] to ensure a fair comparison with prior studies [6], [7], [9], [34]. The dataset is collected from open-source software projects in GitHub repositories with non-viral licenses (e.g.…”

Section: Datasetmentioning

confidence: 99%

“…Threats to external validity relate to the degree to which our approach can be generalized across other context. We evaluate our PyCoder with 50,000 python files from PY150 dataset which is the dataset used in many literature [3], [6], [7], [9], [11], [12], [34]. We also evaluate the model with the code completion benchmark in CodeXGLUE [3].…”

Section: Threats To Validitymentioning

confidence: 99%

See 1 more Smart Citation

Syntax-Aware On-the-Fly Code Completion

Takerngsaksiri¹,

Tantithamthavorn²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and align with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.

show abstract

Section: Datasetmentioning

confidence: 99%

Section: Threats To Validitymentioning

confidence: 99%

Syntax-Aware On-the-Fly Code Completion

Takerngsaksiri¹,

Tantithamthavorn²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Their setting differs from ours in assuming access to a candidate provider of reasonable quality upfront. On the other hand, Svyatkovskiy et al [67] and Wang et al [74] respectively study multilingual and whole line completion using Transformers. Our work is, in every respect, orthogonal to the two aforementioned, as the idea of leveraging fine-tuned relations for better completion is applicable to both settings Lastly, we comment that ML4Code researchers have borrowed successful ideas in the NLP community such as pretraining large Transformers on heterogenous datasets for transfer learning and multi-task learning [24,26,23,35].…”

Section: Related Workmentioning

confidence: 99%

Learning to Extend Program Graphs to Work-in-Progress Code

Li,

Maddison,

Tarlow

2021

Preprint

View full text Add to dashboard Cite

Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since highperforming models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to workin-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks.

show abstract

“…While early research focused mostly on narrow API-level completion [5,15,32], modern language models based on neural networks vary from fine-grained, using every possible lexical token type (delimiters, operators, white spaces, keywords, etc.) [14], to coarsegrained, predicting entire lines of code [8,45].…”

Section: Introductionmentioning

confidence: 99%

“…In practice, over time, effective token-level code completion can save the users a lot of effort. However, our approach is easy to extend to other types of completion, and we leave applying the usage of logs for the full-line version of code completion [45] for subsequent work.…”

Section: Introductionmentioning

confidence: 99%

All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs

Bibaev¹,

Kalina²,

Lomshakov³

et al. 2022

Preprint

View full text Add to dashboard Cite

Integrated Development Environments (IDEs) are designed to make users more productive, as well as to make their work more comfortable. To achieve this, a lot of diverse tools are embedded into IDEs, and the developers of IDEs can employ anonymous usage logs to collect the data about how they are being used to improve them. A particularly important component that this can be applied to is code completion, since improving code completion using statistical learning techniques is a well-established research area.In this work, we propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832.The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side. Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models -this way, we have been using it in production since the end of 2020.

show abstract

Towards Full-line Code Completion with Neural Language Models

Cited by 7 publications

References 17 publications

Syntax-Aware On-the-Fly Code Completion

Syntax-Aware On-the-Fly Code Completion

Learning to Extend Program Graphs to Work-in-Progress Code

All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs

Contact Info

Product

Resources

About