Predicting Good Configurations for GitHub and Stack Overflow Topic Models

Wagner, Markus

doi:10.1109/msr.2019.00022

Cited by 30 publications

(28 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DEOptim package also requires as an input the boundaries it has to take into account while performing optimization. Defining these boundaries can be difficult, and recent research by Treude and Wagner (2019) suggests that very wide scales should be used for the values. However, using wide scales led to a very unusual result, where all of the topics were practically identical.…”

Section: Nlp Technique 2: Topic Discoverymentioning

confidence: 99%

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Rantala

Mäntylä

2020

Software Qual J

View full text Add to dashboard Cite

Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.

show abstract

Section: Nlp Technique 2: Topic Discoverymentioning

confidence: 99%

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Rantala

Mäntylä

2020

Software Qual J

View full text Add to dashboard Cite

show abstract

“…Topic modeling is a text mining and concept extraction method that extracts topics (i.e., coherent word clusters) from large corpora of textual documents to discovery hidden semantic structures in text (Miner et al 2012). An advantage of topic modeling over other techniques is that it helps analyzing long texts (Treude and Wagner 2019;Miner et al 2012), creates clusters as "topics" (rather than individual words) and is unsupervised (Miner et al 2012).…”

Section: Introductionmentioning

confidence: 99%

“…Topic modeling techniques accept different types of textual documents and require the configuration of parameters (see Section 2.1). Carefully choosing parameters (such as the number of topics to be generated) is essential for obtaining valuable and reliable topics (Agrawal et al 2018;Treude and Wagner 2019). This RQ aims at analysing types of textual data (e.g., source code), actual documents (e.g., a Java class or an individual Java method) and configured parameters used for topic modeling to address software engineering problems.…”

Section: Introductionmentioning

confidence: 99%

Topic modeling in software engineering research

2021

View full text Add to dashboard Cite

Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.

show abstract

“…On the other hand, the strategy on GitHub included all the above, plus removing: (i) code comments, (ii) characters denoting section headers, (iii) vertical and horizontal lines, (iv) characters that represent links or formatting. This was the solution proposed and performed by Treude and Wagner [25].…”

Section: Pre-processing Of Text Data In Software Engineeringmentioning

confidence: 97%

Understanding and Predicting Software Developer Expertise in Stack Overflow and GitHub

Vadlamani¹

View full text Add to dashboard Cite

I, Sri Lakshmi Vadlamani, would like to express my sincere gratitude to my amazing supervisor, Professor Olga Baysal, for her continuous guidance, advice, and friendly discussions. I was able to successfully complete this work solely because of her continuous efforts, her valuable feedback and positive reinforcements at every stage of this project and also through out my MS journey.I am very thankful to God for giving me two sons, Karthik and Krithik; my children thoroughly supported me during this journey by showing great situational awareness and for encouraging me and keeping me optimistic at every step of this journey and especially in some difficult moments. Also, I am thankful to my parents, my in-laws, my siblings and my husband for their support and motivation. I am thankful to my peers Norbert Eke and Khadija Osman for their patience, optimism, support, and encouragement when I needed it the most.

show abstract

Predicting Good Configurations for GitHub and Stack Overflow Topic Models

Cited by 30 publications

References 54 publications

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Topic modeling in software engineering research

Understanding and Predicting Software Developer Expertise in Stack Overflow and GitHub

Contact Info

Product

Resources

About