One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation-an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code. In this paper, we present PathMiner-an open-source library for mining path-based representations of code. Path-Miner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language.
Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-theart results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
CCS CONCEPTS• Software and its engineering → Software maintenance tools; Software verification and validation; • Security and privacy → Malware and its mitigation.
In software engineering, it is not enough to simply write code that only works as intended, even if it is free from vulnerabilities and bugs. Every programming language has a style guide and a set of best practices defined by its community, which help practitioners to build solutions that have a clear structure and therefore are easy to read and maintain. To introduce assessment of code quality into the educational process, we developed a tool called Hyperstyle. To make it reflect the needs of the programming community and at the same time be easily extendable, we built it upon several existing professional linters and code checkers. Hyperstyle supports four programming languages (Python, Java, Kotlin, and Javascript) and can be used as a standalone tool or integrated into a MOOC platform. We have integrated the tool into two educational platforms, Stepik and JetBrains Academy, and it has been used to process about one million submissions every week since May 2021.
CCS CONCEPTS• Social and professional topics → Student assessment; • Applied computing → Education; • Human-centered computing → Interactive systems and tools.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.