Hlib Babii scite author profile

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available. CCS CONCEPTS • Software and its engineering → Software maintenance tools.

show abstract

Can OpenAI's codex fix bugs?

Prenner

Babii

Robbes

2022

View full text Add to dashboard Cite

Recently, we can notice a transition to data-driven techniques in Automated Program Repair (APR), in particular towards deep neural networks. This entails training on hundreds of thousands or even millions of non-executable code fragments. We would like to bring more attention to an aspect of code often neglected in Neural Program Repair (NPR), namely its execution. Code execution has several significant advantages. It allows for test-based evaluation of candidate fixes and can provide valuable information to aid repair. In this work we present a fully executable dataset of 450,000 small buggy/fixed program pairs originally submitted to programming competition websites written in eight different programming languages. Along with the dataset we provide infrastructure to compile, safely execute and test programs as well as fine-grained bug-type labels. To give a point of reference, we provide basic evaluation results for two baselines, one based on a generate-andvalidate approach and one on deep learning. With this dataset we follow several goals: we want to lift Neural Program Repair beyond fully static code representations, foster the use of executionbased features and, by including several different languages, counterbalance the predominance of Java in the current landscape of APR datasets and benchmarks. Keywords automated program repair, data-driven software engineering, fault localizationRecently, more and more APR research builds on deep learning. So much so that this sub-field has been given its own name: Neural Program Repair (NPR). NPR systems are trained on up to millions of buggy/fixed code fragment pairs. So far, there is a strong focus on static code features, in particular textual features [13,47,32], less commonly, tree or graph representations [43,17].Because NPR systems are data hungry and manually collecting and isolating bugs is infeasible on a large scale, copious amounts of bug data are mined from open source code repositories (e.g., GitHub) [71].

show abstract

Mining Software Repositories with a Collaborative Heuristic Repository

Babii

Prenner

Stricker

et al. 2021

View full text Add to dashboard Cite

Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collaborative heuristic repository, to which researchers can contribute a large amount of diverse heuristics for a variety of tasks on a variety of SE artifacts. These heuristics are then leveraged by state-of-the-art weak supervision techniques to train high-quality classifiers, thus improving the categorizations. We present an initial version of the heuristic repository, which we applied to the concrete task of commit classification.

show abstract

Preprocessed Java Code Corpus

Karampatsis¹,

Babii²,

Robbes³

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hlib Babii

Open-vocabulary models for source code

Big code != big vocabulary

Can OpenAI's codex fix bugs?

Mining Software Repositories with a Collaborative Heuristic Repository

Preprocessed Java Code Corpus

Contact Info

Product

Resources

About