Hierarchical Clustering of OSS License Statements toward Automatic Generation of License Rules

Higashi, Yunosuke; Ohira, Masaichi; Kashiwa, Yutaro; Manabe, Yuki

doi:10.2197/ipsjjip.27.42

Cited by 1 publication

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 2 shows an overview of the proposed method consisting of the following 5 major processes. 3.1 First, the clustering method of our previous studies [2], [3] is used to classify the unknown license statements according to the similarity of the word vectors (Bag-of-Words). 3.2 Next, the license statements that cannot be classified due to slight differences in minor versions are flittered out as outliers by the similarity based on the Levenshtein distance.…”

Section: Proposed Methodsmentioning

confidence: 99%

“…(Step 3) Generating license rules: Tokenize license statements as regular expressions and generate license rules that can be matched to new licenses. In our previous study [2], [3], we proposed a clustering method to classify the license statements of detected unknown licenses to automate Step 1. In this paper, we focus on Step 2 and Step 3 to automatically generate license rules from each cluster created by the clustering method.…”

Section: (Step 1) Grouping Source Files With Unknown Licensesmentioning

confidence: 99%

“…The goal of our study is to automatically generate candidate license rules to support the creation of license rules. In our previous study [3], we proposed a clustering method to classify unknown license statements by license name. In this paper, we extract expression patterns from clusters classified by our clustering method and generate regular expressions.…”

Section: Technical Challengesmentioning

confidence: 99%

“…In our previous study [2], [3], we proposed a clustering method to automate the grouping of license statements in Step (I) described in Section 1. It consists of the three parts as follows.…”

Section: Hierarchical Clustering Of Oss License Statementsmentioning

confidence: 99%

“…In the previous study [2], [3], we calculated and evaluated the per-centage of clusters consisting of a single license. The ratio of the single license clusters was high enough for FreeBSD v10.3.0 (91.7%) and Linux Kernel v4.4.6 (90.7%) respectively, but it was not so high for Debian v7.8.0 (69.1%).…”

Section: H(c P ) > H(s C I ) and H(cmentioning

confidence: 99%

See 4 more Smart Citations

Automating License Rule Generation to Help Maintain Rule-based OSS License Identification Tools

Higashi

Ohira

Manabe

2023

Journal of Information Processing

Self Cite

View full text Add to dashboard Cite

Many license identification tools have been proposed to support OSS reuse. License identification tools automatically identify OSS licenses declared in source files. Ninka is one of the most accurate license identification tools. Because OSS licenses are often newly created or inherited, rules built into license identification tools need to be created and updated on a regular basis. However, when a large number of unknown licenses are detected in large OSS products, it is not easy to manually create new rules. In our previous studies, we proposed a method for clustering license statements that Ninka determined to be unknown. In this paper, we propose a method to automatically generate license rules from the clustered license statements. Our approach further filters the license statements from the created clusters to extract sequential patterns and converts the extracted patterns into regular expressions. We conducted conduct a case study where our method is applied to 1,821, 3,561 and 2,838 unknown license statement files respectively collected from FreeBSD v10.3.0, Linux Kernel v4.4.6, and Debian v7.8.0, to confirm the usefulness of our method. As a result, we confirmed that our method successfully generated license rules that take into consideration the orthographical variants and that our method also efficiently identified licenses with a small number of license rules. Furthermore, we found that adding the license rules generated by our method to Ninka improves the licensing rule performance.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%

Section: (Step 1) Grouping Source Files With Unknown Licensesmentioning

confidence: 99%

Section: Technical Challengesmentioning

confidence: 99%

Section: Hierarchical Clustering Of Oss License Statementsmentioning

confidence: 99%

Section: H(c P ) > H(s C I ) and H(cmentioning

confidence: 99%

See 3 more Smart Citations