Measuring LDA topic stability from clusters of replicated runs

Mäntylä, Martti; Claes, Maelick; Farooq, Umar

doi:10.1145/3239235.3267435

Cited by 39 publications

(25 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, the other blocks of this column-cluster are considered to be non-meaningful, and have a block effect parameter δ, which is common to all non-meaningful blocks. In the second section, we note, for example, that for h = 4 blocks (1, 4) and (2,4) are meaningful, and share the same block effect δ 4 . This means that terms from column-cluster 4 are specific to 225 documents from row-clusters 1 and 2.…”

Section: An Easy-to-read Structurementioning

confidence: 89%

“…For instance, recently, [4] combines LDA and clustering algorithms to highlight the main topics of their clusters. In [5], the authors analyse scientific literature ent approach.…”

Section: Introductionmentioning

confidence: 99%

“…The main motivation of this paper is to provide a tool with high comprehensibility: having 90 three sections offers explicable results, with a reasonable number of co-clusters. 4 The choice to constrain our model to pairwise interactions between clusters is essentially motivated by the classical ANOVA modelling, which is usually limited to the two-way analysis. Furthermore, pairwise interactions are more interpretable than higher order interactions, and interactions between more than 95 three factors are expected to be infrequent.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Textual data summarization using the Self-Organized Co-Clustering model

Selosse

Jacques

Biernacki

2020

Pattern Recognition

View full text Add to dashboard Cite

Recently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-theart methods for document and term clustering and offers user-friendly results.The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model's inference as well as a model selection criterion to choose the number of coclusters. Both simulated and real data sets illustrate the efficiency of this model by its ability to easily identify relevant co-clusters.related to the field of e-Health.In [6], the authors describe the Biterm Topic Model (BTM). It outperforms LDA on short texts (such as instant messages and tweets) for which LDA performs poorly, due to the sparsity of the data. In [7], the authors propose another version of the BTM: they represent the biterms (word-pairs) as graphs and use a deep convolutional network to encode word 25 co-relationships.This work presents the Self-Organised Co-Clustering model (SOCC). It aims at providing a tool to summarize large document-term matrices, whose rows correspond to documents and columns correspond to terms. The clustering ap-2 proach, which forms homogeneous groups of observations (documents in this 30 case), is a useful unsupervised technique with proven efficiency in several domains. However, in high-dimensional and sparse contexts, they are sometimes less adapted and difficult to interpret. When considering such data sets, coclustering, which groups observations and features simultaneously, turns out to be more efficient. It exploits the dualism between rows and columns and the 35 data set is summarized in blocks (the crossing of a row-cluster and a columncluster). The clusters of documents help in finding similar documents while the clusters of terms tell us what the clusters of documents are about. In this context, our work helps in finding similar documents and their interaction with term clusters. 40The co-clustering task can be done in several ways. For example, in [8], the authors describe an original approach that uses optimal transport theory to co-cluster continuous data. However, we mostly distinguish between two kinds of co-clustering approaches. Matrix factorization based methods, e.g.[9, 10], consist of factorizing the N × J data matrix x into three matrices a (of 45 size N × G), b (size G × H) and c (...

show abstract

Section: An Easy-to-read Structurementioning

confidence: 89%

“…For instance, recently, [4] combines LDA and clustering algorithms to highlight the main topics of their clusters. In [5], the authors analyse scientific literature ent approach.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Textual data summarization using the Self-Organized Co-Clustering model

Selosse

Jacques

Biernacki

2020

Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…Binkley et al [6] ran run the Gibbs sampler multiple times suggesting that it reduces the probability of getting stuck in local optima. Recently, Mantyla et al [24] performed multiple LDA runs and combined the results of different runs through clustering.…”

Section: Stability Of the Generated Topicsmentioning

confidence: 99%

A Systematic Comparison of Search Algorithms for Topic Modelling—A Study on Duplicate Bug Report Identification

Panichella

2019

Search-Based Software Engineering

View full text Add to dashboard Cite

Latent Dirichlet Allocation (LDA) has been used to support many software engineering tasks. Previous studies showed that default settings lead to sub-optimal topic modeling with a dramatic impact on the performance of such approaches in terms of precision and recall. For this reason, researchers used search algorithms (e.g., genetic algorithms) to automatically configure topic models in an unsupervised fashion. While previous work showed the ability of individual search algorithms in finding near-optimal configurations, it is not clear to what extent the choice of the meta-heuristic matters for SE tasks. In this paper, we present a systematic comparison of five different meta-heuristics to configure LDA in the context of duplicate bug reports identification. The results show that (1) no master algorithm outperforms the others for all software projects, (2) random search and PSO are the least effective meta-heuristics. Finally, the running time strongly depends on the computational complexity of LDA while the internal complexity of the search algorithms plays a negligible role.

show abstract

“…The motivation of the research conducted by the authors of this paper was the fact that the study of a stable metric for the quality of topics continues. Moreover, the use of cluster analysis is one of the tools for analyzing the stability of topics [29] and the optimal number of topics [30], but it does not consider the benefits of the special training capabilities of the topic model with sequential regularization and dense representation of word-vectors.…”

mentioning

confidence: 99%

The Number of Topics Optimization: Clustering Approach

Краснов

Sen

2019

MAKE

View full text Add to dashboard Cite

Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents.

show abstract

Measuring LDA topic stability from clusters of replicated runs

Cited by 39 publications

References 24 publications

Textual data summarization using the Self-Organized Co-Clustering model

Textual data summarization using the Self-Organized Co-Clustering model

A Systematic Comparison of Search Algorithms for Topic Modelling—A Study on Duplicate Bug Report Identification

The Number of Topics Optimization: Clustering Approach

Contact Info

Product

Resources

About