Proceedings of the 17th International Conference on Mining Software Repositories 2020
DOI: 10.1145/3379597.3387496
|View full text |Cite
|
Sign up to set email alerts
|

A Dataset for GitHub Repository Deduplication

Abstract: GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 27 publications
(13 citation statements)
references
References 35 publications
0
13
0
Order By: Relevance
“…Finally, our set of projects still contains duplicate repositories that originate from manual clones pushed to a di erent repository rather than using the fork mechanic recorded in the GHTorrent data. Removing these clones is an important challenge when selecting repositories for analysis, and independent data sets listing duplicate repositories have been developed [50]. Unfortunately, these data sets were not yet available at the time of our analysis.…”
Section: Population Validitymentioning
confidence: 99%
“…Finally, our set of projects still contains duplicate repositories that originate from manual clones pushed to a di erent repository rather than using the fork mechanic recorded in the GHTorrent data. Removing these clones is an important challenge when selecting repositories for analysis, and independent data sets listing duplicate repositories have been developed [50]. Unfortunately, these data sets were not yet available at the time of our analysis.…”
Section: Population Validitymentioning
confidence: 99%
“…Other than the stated selection criteria, we did not perform any other manual adjustments to add popular projects or exclude obscure ones. We ensured that our data set did not include duplicated projects by pairing it with a dataset for GitHub repository deduplication (Spinellis, Kotti & Mockus, 2020). From the one duplicate and two triplicate sets we thus found we retained the repositories with the longer commit history.…”
Section: Programming Languagementioning
confidence: 99%
“…Datasets that only focus on source code also exist, such as Boa, a dataset of queryable Java AST presented by Dyer et al [29]. Spinellis et al [30] focus on identifying duplicated repositories on GitHub.…”
Section: Related Workmentioning
confidence: 99%