Reusing open source software (OSS) components for one's own software products has become common in the modern software development. Automated license identification tools have been proposed to help developers identify OSS licenses, since a large number of licenses sometimes must be checked before attempting to reuse. Of the existing tools, Ninka [1] can most correctly identify licenses of each source file by using regular expressions. In case Ninka does not have license identification rules for unknown licenses, Ninka reports these as "unknown licenses" which must be checked by developers manually. Since completely-new or derived OSS licenses appear nearly every year, a license identification tool should be appropriately maintained by adding regular expressions corresponding to the new licenses. The final goal of our study is to construct a method to automatically create candidate license rules to be added to a license identification tool such as Ninka. Toward achieving the goal, files identified as unknown licenses must be classified by license firstly. In this paper, we propose a hierarchical clustering which divides unknown licenses into clusters of files with a single license. We conduct a case study to confirm the usefulness of our clustering method when it is applied for classifying 2,801, 1,230 and 2,446 unknown license statement files for Linux Kernel v4.4.6, FreeBSD v10.3.0 and Debian v7.8.0 respectively. As a result, it is confirmed that our method can create clusters which are suitable as candidates for generating license rules automatically.
Reusing open source software (OSS) components for own software products has become common in the modern software development. Automated license identification tools has been proposed to help developers identify OSS licenses, since a large number of licenses sometimes must be checked to be reused. Of the existing tools, Ninka [1] can most correctly identify licenses of each source file by using regular expressions. In case Ninka does not have license identification rules for unknown licenses, Ninka reports they are "unknown licenses" which must be checked by developers manually. Since completelynew or derived OSS licenses appear nearly every year, a license identification tool should be appropriately maintained by adding regular expressions corresponding to the new licenses. The final goal of our study is to construct a method to automatically create candidates of license rules to be added to a license identification tool such as Ninka. Toward achieving the goal, files identified as unknown licenses must be classified by license firstly. In this paper, we propose a hierarchical clustering which divides unknown licenses into clusters of files with a single license. We conduct a case study to confirm the usefulness of our clustering method when it is applied for classifying 2,838 unknown license files of Debian v7.8.0. As a result, it is confirmed that our method can create clusters which are suitable as candidates for generating license rules automatically.
Many license identification tools have been proposed to support OSS reuse. License identification tools automatically identify OSS licenses declared in source files. Ninka is one of the most accurate license identification tools. Because OSS licenses are often newly created or inherited, rules built into license identification tools need to be created and updated on a regular basis. However, when a large number of unknown licenses are detected in large OSS products, it is not easy to manually create new rules. In our previous studies, we proposed a method for clustering license statements that Ninka determined to be unknown. In this paper, we propose a method to automatically generate license rules from the clustered license statements. Our approach further filters the license statements from the created clusters to extract sequential patterns and converts the extracted patterns into regular expressions. We conducted conduct a case study where our method is applied to 1,821, 3,561 and 2,838 unknown license statement files respectively collected from FreeBSD v10.3.0, Linux Kernel v4.4.6, and Debian v7.8.0, to confirm the usefulness of our method. As a result, we confirmed that our method successfully generated license rules that take into consideration the orthographical variants and that our method also efficiently identified licenses with a small number of license rules. Furthermore, we found that adding the license rules generated by our method to Ninka improves the licensing rule performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.