Topic models have been prevailing for many years on discovering latent semantics while modeling long documents. However, for short texts they generally suffer from data sparsity because of extremely limited word cooccurrences; thus tend to yield repetitive or trivial topics with low quality. In this paper, to address this issue, we propose a novel neural topic model in the framework of autoencoding with a new topic distribution quantization approach generating peakier distributions that are more appropriate for modeling short texts. Besides the encoding, to tackle this issue in terms of decoding, we further propose a novel negative sampling decoder learning from negative samples to avoid yielding repetitive topics. We observe that our model can highly improve short text topic modeling performance. Through extensive experiments on real-world datasets, we demonstrate our model can outperform both strong traditional and neural baselines under extreme data sparsity scenes, producing high-quality topics.
Topic models are effective in capturing the latent semantics of large-scale textual data while existing methods are normally designed and evaluated on balanced corpora. However, it contradicts the fact that general corpora in our world are naturally long-tailed, and the longtailed bias can highly impair the topic modeling performance. Therefore, in this paper, we propose a causal inference framework to explain and overcome the issues of topic modeling on long-tailed corpora. In a neat and elegant way, causal intervention is applied in training to take out the influence brought by the long-tailed bias. Extensive experiments on manually constructed and naturally collected datasets demonstrate that our model can mitigate the bias effect, greatly improve topic quality and better discover the hidden semantics on the tail.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.