Proceedings of the 15th International Conference on Mining Software Repositories 2018
DOI: 10.1145/3196398.3196464
|View full text |Cite
|
Sign up to set email alerts
|

Public git archive

Abstract: The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git versioncontrolled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive -dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel dat… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 31 publications
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…GitHub 35 is popular for collecting large volumes of code data. 23,[36][37][38] Unlike proprietary data, open-source code is not reliably high-quality. Open-source data is therefore only included in the training split of the SKILL dataset, not in the evaluation splits.…”
Section: Open-source Skill Datamentioning
confidence: 99%
“…GitHub 35 is popular for collecting large volumes of code data. 23,[36][37][38] Unlike proprietary data, open-source code is not reliably high-quality. Open-source data is therefore only included in the training split of the SKILL dataset, not in the evaluation splits.…”
Section: Open-source Skill Datamentioning
confidence: 99%
“…Six more papers mentioned that their dataset did not include user names and email addresses and/or how privacy was ensured. Markovtsev and Long (2018) discuss how their dataset complies with GitHub terms and conditions.…”
Section: Data Showcasementioning
confidence: 99%