2010 10th IEEE Working Conference on Source Code Analysis and Manipulation 2010
DOI: 10.1109/scam.2010.22
|View full text |Cite
|
Sign up to set email alerts
|

Estimating the Optimal Number of Latent Concepts in Source Code Analysis

Abstract: The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dir… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
49
0

Year Published

2012
2012
2019
2019

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 45 publications
(49 citation statements)
references
References 25 publications
0
49
0
Order By: Relevance
“…A few articles proposed approaches determining the number of topics [28], [29], but they were task-specific. Our paper explore the application of topic modeling in a generic perspective other than a task-driven style, so we need to be groping for other estimating approach.…”
Section: The Number Of Topicsmentioning
confidence: 99%
See 1 more Smart Citation
“…A few articles proposed approaches determining the number of topics [28], [29], but they were task-specific. Our paper explore the application of topic modeling in a generic perspective other than a task-driven style, so we need to be groping for other estimating approach.…”
Section: The Number Of Topicsmentioning
confidence: 99%
“…For instance, we used JHotDraw [27], a Java GUI framework, as our learning object. Referring to Grant and Cordy [29], where they think 100 to 200 is the best area for the number of topics of JHotDraw, we tested the number of topics ranging from 50 to 250 in 10 increments and evaluated each result using our Naive Criterion. We found that 80 is the most optimum value for the number of topics of JHotDraw.…”
Section: The Number Of Topicsmentioning
confidence: 99%
“…However, in light of a recent study that showed that source code is exhibiting different characteristics that natural language text (e.g., it is more predictable and more repetitive) [6], we argue that using the same parameter values used in the IR community may not produce optimal results for SE. Although there were some heuristics [15,16] for configuring LDA parameters, these approaches focus only on configuring the number of topics, excluding the other hyper-parameters.…”
Section: B Lda-gamentioning
confidence: 99%
“…Such methods can be a) manual, based on a domain expert understanding of the system [7,180] , b) experimentallydetermined, in which LDA parameters are tuned until a configuration that achieves acceptable performance over a certain quality measure is reached [16,22], or c) automatically generated using statistical methods or machine learning approaches [101,202].…”
Section: Latent Dirichlet Allocationmentioning
confidence: 99%