Yutaro Kashiwa scite author profile

The accuracy of the SZZ algorithm is pivotal for just-in-time defect prediction because most prior studies have used the SZZ algorithm to detect defect-inducing commits to construct and evaluate their defect prediction models. The SZZ algorithm has two phases to detect defect-inducing commits: (1) linking issue reports in an issue-tracking system to possible defect-fixing commits in a version control system by using an issue-link algorithm (ILA); and (2) tracing the modifications of defect-fixing commits back to possible defect-inducing commits. Researchers and practitioners can address the second phase by using existing solutions such as a tool called . In contrast, although various ILAs have been proposed for the first phase, no large-scale studies exist in which such ILAs are evaluated under the same experimental conditions. Hence, we still have no conclusions regarding the best-performing ILA for the first phase. In this paper, we compare 10 ILAs collected from our systematic literature study with regards to the accuracy of detecting defect-fixing commits. In addition, we compare the defect prediction performance of ILAs and their combinations that can detect defect-fixing commits accurately. We conducted experiments on five open-source software projects. We found that all ILAs and their combinations prevented the defect prediction model from being affected by missing defect-fixing commits. In particular, the combination of a natural language text similarity approach, Phantom heuristics, a random forest approach, and a support vector machine approach is the best way to statistically significantly reduced the absolute differences from the ground-truth defect prediction performance. We summarized the guidelines to use ILAs as our recommendations.

show abstract

Hierarchical Clustering of OSS License Statements toward Automatic Generation of License Rules

Higashi

Ohira

Kashiwa

et al. 2019

Journal of Information Processing

View full text Add to dashboard Cite

Reusing open source software (OSS) components for one's own software products has become common in the modern software development. Automated license identification tools have been proposed to help developers identify OSS licenses, since a large number of licenses sometimes must be checked before attempting to reuse. Of the existing tools, Ninka [1] can most correctly identify licenses of each source file by using regular expressions. In case Ninka does not have license identification rules for unknown licenses, Ninka reports these as "unknown licenses" which must be checked by developers manually. Since completely-new or derived OSS licenses appear nearly every year, a license identification tool should be appropriately maintained by adding regular expressions corresponding to the new licenses. The final goal of our study is to construct a method to automatically create candidate license rules to be added to a license identification tool such as Ninka. Toward achieving the goal, files identified as unknown licenses must be classified by license firstly. In this paper, we propose a hierarchical clustering which divides unknown licenses into clusters of files with a single license. We conduct a case study to confirm the usefulness of our clustering method when it is applied for classifying 2,801, 1,230 and 2,446 unknown license statement files for Linux Kernel v4.4.6, FreeBSD v10.3.0 and Debian v7.8.0 respectively. As a result, it is confirmed that our method can create clusters which are suitable as candidates for generating license rules automatically.

show abstract

PyRef: Refactoring Detection in Python Projects

Atwi¹,

Lin²,

Tsantalis³

et al. 2021

Preprint

View full text Add to dashboard Cite

An empirical study on self-admitted technical debt in modern code review

Kashiwa¹,

Nishikawa²,

Kamei³

et al. 2022

Information and Software Technology

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yutaro Kashiwa

A Dataset of High Impact Bugs: Manually-Classified Issue Reports

An empirical study of issue-link algorithms: which issue-link algorithms should we use?

Hierarchical Clustering of OSS License Statements toward Automatic Generation of License Rules

PyRef: Refactoring Detection in Python Projects

An empirical study on self-admitted technical debt in modern code review

Contact Info

Product

Resources

About