Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub

Cheng, Can; Li, Bing; Li, Zengyang; Liang, Peng

doi:10.18293/seke2018-085

Cited by 6 publications

(14 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It uses greedy strategy to generate decision trees. We selected this method because it has been tested to be effective in selecting PDPs [4]. Logistic Regression (LR) .…”

Section: Methodsmentioning

confidence: 99%

“…Peril 4 still needs to be solved by researchers manually. In addition, peril 5 cannot be effectively solved by the corresponding strategy because in our previous work [4] we tested this strategy and found that this strategy cannot select PDPs with a high recall, which means that if researchers use the committer number to select project samples, they will miss many PDPs. Hence, if researchers do not want to select projects that are personal or projects that are not built for development, they have to spend considerable human effort to select samples manually.…”

Section: Related Workmentioning

confidence: 99%

“…Method 4 is a good choice in selecting a large number of PDPs [4]. Compared with the base line methods, Method 4 uses the J48 decision tree algorithm and the project description features.…”

Section: Sample Selection Process In Studies On Ghtorrentmentioning

confidence: 99%

“…blogs, projects that store the list of popular websites). That is, this method has a low precision (lower than 0.700) in selecting PDPs [4].…”

Section: Related Workmentioning

confidence: 99%

“…In our previous study [4], we proposed a machine learning‐based approach as an initial solution to this problem. Specifically, we first labelled 6369 project samples on whether these projects are PDPs or not; then, we added the basic features (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

An in‐depth study of the effects of methods on the dataset selection of public development projects

Cheng

et al. 2021

IET Software

Self Cite

View full text Add to dashboard Cite

Public development projects (PDPs) and documented public development projects (DPDPs) are two types of projects that can provide valuable information on how developers and users participate in OSS projects. However, it is hard for researchers to effectively select PDPs and DPDPs due to the lack of specific project selection methods for these two types of projects. To address this problem, a standard dataset was labelled and the base line methods (i.e. selecting projects according to a single feature like star number) under 60 configurations and the machine learning methods under 18 configurations were tested to identify the best configurations in precision and F-measure for selecting PDPs and DPDPs. The results show that (1) to select PDPs or DPDPs with a high precision, the base line method is the best with precision of 0.877 (PDPs) and 0.831 (DPDPs); (2) to select PDPs or DPDPs with a high F-measure, the machine learning methods are the best, with F-measure of 0.817 (PDPs) and 0.789 (DPDPs); (3) existing sample selection strategies can be combined with the machine learning methods, and the precision of selecting PDPs can be increased by 6.39%-41.33% and the precision of selecting DPDPs can be can be increased by 35.50%-269.02%.

show abstract

“…It uses greedy strategy to generate decision trees. We selected this method because it has been tested to be effective in selecting PDPs [4]. Logistic Regression (LR) .…”

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…Method 4 is a good choice in selecting a large number of PDPs [4]. Compared with the base line methods, Method 4 uses the J48 decision tree algorithm and the project description features.…”

Section: Sample Selection Process In Studies On Ghtorrentmentioning

confidence: 99%

“…blogs, projects that store the list of popular websites). That is, this method has a low precision (lower than 0.700) in selecting PDPs [4].…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations