Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement 2018
DOI: 10.1145/3239235.3267435
|View full text |Cite
|
Sign up to set email alerts
|

Measuring LDA topic stability from clusters of replicated runs

Abstract: Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clust… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
25
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 39 publications
(25 citation statements)
references
References 24 publications
0
25
0
Order By: Relevance
“…On the other hand, the other blocks of this column-cluster are considered to be non-meaningful, and have a block effect parameter δ, which is common to all non-meaningful blocks. In the second section, we note, for example, that for h = 4 blocks (1, 4) and (2,4) are meaningful, and share the same block effect δ 4 . This means that terms from column-cluster 4 are specific to 225 documents from row-clusters 1 and 2.…”
Section: An Easy-to-read Structurementioning
confidence: 89%
See 2 more Smart Citations
“…On the other hand, the other blocks of this column-cluster are considered to be non-meaningful, and have a block effect parameter δ, which is common to all non-meaningful blocks. In the second section, we note, for example, that for h = 4 blocks (1, 4) and (2,4) are meaningful, and share the same block effect δ 4 . This means that terms from column-cluster 4 are specific to 225 documents from row-clusters 1 and 2.…”
Section: An Easy-to-read Structurementioning
confidence: 89%
“…For instance, recently, [4] combines LDA and clustering algorithms to highlight the main topics of their clusters. In [5], the authors analyse scientific literature ent approach.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Binkley et al [6] ran run the Gibbs sampler multiple times suggesting that it reduces the probability of getting stuck in local optima. Recently, Mantyla et al [24] performed multiple LDA runs and combined the results of different runs through clustering.…”
Section: Stability Of the Generated Topicsmentioning
confidence: 99%
“…The motivation of the research conducted by the authors of this paper was the fact that the study of a stable metric for the quality of topics continues. Moreover, the use of cluster analysis is one of the tools for analyzing the stability of topics [29] and the optimal number of topics [30], but it does not consider the benefits of the special training capabilities of the topic model with sequential regularization and dense representation of word-vectors.…”
mentioning
confidence: 99%