A Dataset for GitHub Repository Deduplication

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris

doi:10.1145/3379597.3387496

Cited by 27 publications

(13 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, our set of projects still contains duplicate repositories that originate from manual clones pushed to a di erent repository rather than using the fork mechanic recorded in the GHTorrent data. Removing these clones is an important challenge when selecting repositories for analysis, and independent data sets listing duplicate repositories have been developed [50]. Unfortunately, these data sets were not yet available at the time of our analysis.…”

Section: Population Validitymentioning

confidence: 99%

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Gote¹,

Mavrodiev²,

Schweitzer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project's size and structure in uence team productivity, a question famously considered in Brooks' law. Recent studies using massive repository data suggest that developers in larger teams tend to be less productive than smaller teams. Despite using similar methods and data, other studies argue for a positive linear or even super-linear relationship between team size and productivity, thus contesting the view of software economics that software projects are diseconomies of scale.In our work, we study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data. We further provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the in uence of team size and collaboration patterns on individual and collective productivity. Our work contributes to the ongoing discussion on the choice of productivity metrics in the operationalisation of hypotheses about determinants of successful software projects. It further highlights general pitfalls in big data analysis and shows that the use of bigger data sets does not automatically lead to more reliable insights.

show abstract

Section: Population Validitymentioning

confidence: 99%

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Gote¹,

Mavrodiev²,

Schweitzer³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Other than the stated selection criteria, we did not perform any other manual adjustments to add popular projects or exclude obscure ones. We ensured that our data set did not include duplicated projects by pairing it with a dataset for GitHub repository deduplication (Spinellis, Kotti & Mockus, 2020). From the one duplicate and two triplicate sets we thus found we retained the repositories with the longer commit history.…”

Section: Programming Languagementioning

confidence: 99%

Software evolution: the lifetime of fine-grained elements

Spinellis

Λουρίδας

Kechagia

2021

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the efficiency of software development. We present methods and tools that allow tracking of each line’s or token’s birth and death. Through them, we analyze 3.3 billion source code element lifetime events in 89 revision control repositories. Statistical analysis shows that code lines are durable, with a median lifespan of about 2.4 years, and that young lines are more likely to be modified or deleted, following a Weibull distribution with the associated hazard rate decreasing over time. This behavior appears to be independent from specific characteristics of lines or tokens, as we could not determine factors that influence significantly their longevity across projects. The programing language, and developer tenure and experience were not found to be significantly correlated with line or token longevity, while project size and project age showed only a slight correlation.

show abstract

“…Datasets that only focus on source code also exist, such as Boa, a dataset of queryable Java AST presented by Dyer et al [29]. Spinellis et al [30] focus on identifying duplicated repositories on GitHub.…”

Section: Related Workmentioning

confidence: 99%

DUETS: A Dataset of Reproducible Pairs ofJava Library-Clients

Durieux,

Soto-Valero,

Baudry

2021

Preprint

View full text Add to dashboard Cite

Software engineering researchers look for software artifacts to study their characteristics or to evaluate new techniques. In this paper, we introduce DUETS, a new dataset of software libraries and their clients. This dataset can be exploited to gain many different insights, such as API usage, usage inputs, or novel observations about the test suites of clients and libraries. DUETS is meant to support both static and dynamic analysis. This means that the libraries and the clients compile correctly, they are executable and their test suites pass. The dataset is composed of open-source projects that have more than five stars on GitHub. The final dataset contains 395 libraries and 2,874 clients. Additionally, we provide the raw data that we use to create this dataset, such as 34,560 pom.xml files or the complete file list from 34,560 projects. This dataset can be used to study how libraries are used by their clients or as a list of software projects that successfully build. The client's test suite can be used as an additional verification step for code transformation techniques that modify the libraries.

show abstract

A Dataset for GitHub Repository Deduplication

Cited by 27 publications

References 35 publications

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Software evolution: the lifetime of fine-grained elements

DUETS: A Dataset of Reproducible Pairs ofJava Library-Clients

Contact Info

Product

Resources

About