2008
DOI: 10.1007/s10618-008-0118-x
|View full text |Cite
|
Sign up to set email alerts
|

Sourcerer: mining and searching internet-scale software repositories

Abstract: Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
114
0
4

Year Published

2012
2012
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 216 publications
(118 citation statements)
references
References 48 publications
0
114
0
4
Order By: Relevance
“…However, this work does not quantify how identifier properties vary, since it ignores variable and type names. Search of code at "internet-scale" was introduced by Linstead et al [8]. Another GitHub dataset, GHTorrent [9] has a different goal compared to our corpus, excluding source code and focusing on users, pull requests and all the issues surrounding social coding.…”
Section: Related Workmentioning
confidence: 99%
“…However, this work does not quantify how identifier properties vary, since it ignores variable and type names. Search of code at "internet-scale" was introduced by Linstead et al [8]. Another GitHub dataset, GHTorrent [9] has a different goal compared to our corpus, excluding source code and focusing on users, pull requests and all the issues surrounding social coding.…”
Section: Related Workmentioning
confidence: 99%
“…In order to improve the Classifier's performance, more intelligent source code classification techniques will be implemented in the future (e.g. [12]). …”
Section: Harvesting and Classifying The Learning Materialsmentioning
confidence: 99%
“…Topic modeling has recently been used in several research areas of software engineering, such as mining software repositories (MSR) [108,109,188], requirements traceability [7], and software evolution [111]. Linstead et al [109] applied LDA topic modeling technique on the source code of different versions in order to analyze software evolution.…”
Section: Topic Modeling In Software Engineeringmentioning
confidence: 99%
“…Linstead et al [109] applied LDA topic modeling technique on the source code of different versions in order to analyze software evolution. Linstead and colleagues [108] further used topic modeling on Internet-scale software repositories, and summarized program function and developer activities by extracting topic-word and author-topic distributions. The use of topic modeling over source code has been validated and it has been found that the evolution of source code topics is indeed caused by actual change activities in the code [188].…”
Section: Topic Modeling In Software Engineeringmentioning
confidence: 99%