2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) 2019
DOI: 10.1109/msr.2019.00022
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Good Configurations for GitHub and Stack Overflow Topic Models

Abstract: Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
27
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 30 publications
(28 citation statements)
references
References 54 publications
1
27
0
Order By: Relevance
“…DEOptim package also requires as an input the boundaries it has to take into account while performing optimization. Defining these boundaries can be difficult, and recent research by Treude and Wagner (2019) suggests that very wide scales should be used for the values. However, using wide scales led to a very unusual result, where all of the topics were practically identical.…”
Section: Nlp Technique 2: Topic Discoverymentioning
confidence: 99%
“…DEOptim package also requires as an input the boundaries it has to take into account while performing optimization. Defining these boundaries can be difficult, and recent research by Treude and Wagner (2019) suggests that very wide scales should be used for the values. However, using wide scales led to a very unusual result, where all of the topics were practically identical.…”
Section: Nlp Technique 2: Topic Discoverymentioning
confidence: 99%
“…Topic modeling is a text mining and concept extraction method that extracts topics (i.e., coherent word clusters) from large corpora of textual documents to discovery hidden semantic structures in text (Miner et al 2012). An advantage of topic modeling over other techniques is that it helps analyzing long texts (Treude and Wagner 2019;Miner et al 2012), creates clusters as "topics" (rather than individual words) and is unsupervised (Miner et al 2012).…”
Section: Introductionmentioning
confidence: 99%
“…Topic modeling techniques accept different types of textual documents and require the configuration of parameters (see Section 2.1). Carefully choosing parameters (such as the number of topics to be generated) is essential for obtaining valuable and reliable topics (Agrawal et al 2018;Treude and Wagner 2019). This RQ aims at analysing types of textual data (e.g., source code), actual documents (e.g., a Java class or an individual Java method) and configured parameters used for topic modeling to address software engineering problems.…”
Section: Introductionmentioning
confidence: 99%
“…On the other hand, the strategy on GitHub included all the above, plus removing: (i) code comments, (ii) characters denoting section headers, (iii) vertical and horizontal lines, (iv) characters that represent links or formatting. This was the solution proposed and performed by Treude and Wagner [25].…”
Section: Pre-processing Of Text Data In Software Engineeringmentioning
confidence: 97%