Public git archive

Markovtsev, Vadim; Long, Waren

doi:10.1145/3196398.3196464

Cited by 31 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GitHub 35 is popular for collecting large volumes of code data. 23,[36][37][38] Unlike proprietary data, open-source code is not reliably high-quality. Open-source data is therefore only included in the training split of the SKILL dataset, not in the evaluation splits.…”

Section: Open-source Skill Datamentioning

confidence: 99%

A machine learning approach towards SKILL code autocompletion

Dehaerne,

Dey,

Meert

et al. 2024

DTCO and Computational Patterning III

View full text Add to dashboard Cite

As Moore's Law continues to increase the complexity of electronic systems, Electronic Design Automation (EDA) must advance to meet global demand. An important example of an EDA technology is SKILL, a scripting language used to customize and extend EDA software. Recently, code generation models using the transformer architecture have achieved impressive results in academic settings and have even been used in commercial developer tools to improve developer productivity. To the best of our knowledge, this study is the first to apply transformers to SKILL code autocompletion towards improving the productivity of hardware design engineers. In this study, a novel, data-efficient methodology for generating SKILL code is proposed and experimentally validated. More specifically, we propose a novel methodology for (i) creating a high-quality SKILL dataset with both unlabeled and labeled data, (ii) a training strategy where T5 models pre-trained on general programming language code are fine-tuned on our custom SKILL dataset using self-supervised and supervised learning, and (iii) evaluating synthesized SKILL code. We show that models trained using the proposed methodology outperform baselines in terms of human-judgment score and BLEU score. A major challenge faced was the extremely small amount of available SKILL code data that can be used to train a transformer model to generate SKILL code. Despite our validated improvements, the extremely small dataset available to us was still not enough to train a model that can reliably autocomplete SKILL code. We discuss this and other limitations as well as future work that could address these limitations.

show abstract

Section: Open-source Skill Datamentioning

confidence: 99%

A machine learning approach towards SKILL code autocompletion

Dehaerne,

Dey,

Meert

et al. 2024

DTCO and Computational Patterning III

View full text Add to dashboard Cite

show abstract

“…Six more papers mentioned that their dataset did not include user names and email addresses and/or how privacy was ensured. Markovtsev and Long (2018) discuss how their dataset complies with GitHub terms and conditions.…”

Section: Data Showcasementioning

confidence: 99%

Ethics in the mining of software repositories

Gold

Krinke

2021

Empir Software Eng

View full text Add to dashboard Cite

Research in Mining Software Repositories (MSR) is research involving human subjects, as the repositories usually contain data about developers’ and users’ interactions with the repositories and with each other. The ethics issues raised by such research therefore need to be considered before beginning. This paper presents a discussion of ethics issues that can arise in MSR research, using the mining challenges from the years 2006 to 2021 as a case study to identify the kinds of data used. On the basis of contemporary research ethics frameworks we discuss ethics challenges that may be encountered in creating and using repositories and associated datasets. We also report some results from a small community survey of approaches to ethics in MSR research. In addition, we present four case studies illustrating typical ethics issues one encounters in projects and how ethics considerations can shape projects before they commence. Based on our experience, we present some guidelines and practices that can help in considering potential ethics issues and reducing risks.

show abstract

Learning Based Methods for Code Runtime Complexity Prediction

Sikka¹,

Satya²,

Kumar³

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Predicting the runtime complexity of a programming code is an arduous task. In fact, even for humans, it requires a subtle analysis and comprehensive knowledge of algorithms to predict time complexity with high fidelity, given any code. As per Turing's Halting problem proof, estimating code complexity is mathematically impossible. Nevertheless, an approximate solution to such a task can help developers to get real-time feedback for the efficiency of their code. In this work, we model this problem as a machine learning task and check its feasibility with thorough analysis. Due to the lack of any open source dataset for this task, we propose our own annotated dataset CoRCoD: Code Runtime Complexity Dataset 4 , extracted from online judges. We establish baselines using two different approaches: feature engineering and code embeddings, to achieve state of the art results and compare their performances. Such solutions can be widely useful in potential applications like automatically grading coding assignments, IDE-integrated tools for static code analysis, and others.

show abstract

Public git archive

Cited by 31 publications

References 22 publications

A machine learning approach towards SKILL code autocompletion

A machine learning approach towards SKILL code autocompletion

Ethics in the mining of software repositories

Learning Based Methods for Code Runtime Complexity Prediction

Contact Info

Product

Resources

About