Topic Modeling in Embedding Spaces

Dieng, Adji B.; Ruiz, Francisco J. R.; Blei, David M.

doi:10.1162/tacl_a_00325

Cited by 455 publications

(314 citation statements)

References 27 publications

Supporting

Mentioning

308

Contrasting

Unclassified

Order By: Relevance

“…Using one of the most popular embedding algorithms, a generalization of the original word2vec algorithm (Mikolov, Chen, Corrado, & Dean, 2013), termed Glove (Pennington, Socher, & Manning, 2014), our selected target word “feeling,” produced high similarity scores for “i,” “today,” and “good,” when considering the full document. LDA has been recently integrated with embeddings (e.g., Dieng, Ruiz, & Blei, 2019), both in using topics to create embeddings and identifying latent topics directly from embeddings; consequently, the LDA model can better incorporate the context of each word, moving beyond an unordered representation of words.…”

Section: Statistical Algorithmsmentioning

confidence: 99%

The use of text‐based responses to improve our understanding and prediction of suicide risk

Jacobucci

Ammerman

Wilcox

2021

Suicide & Life Threat Behav

View full text Add to dashboard Cite

ObjectiveText‐based responses may provide significant contributions to suicide risk prediction, yet research including text data is limited. This may be due to a lack of exposure and familiarity with statistical analyses for this data structure.MethodThe current study provides an overview of data processing and statistical algorithms for text data, guided by an empirical example of 947 online participants who completed both open‐ended items and traditional self‐report measures. We give an introduction to a number of text‐based statistical approaches, including dictionary‐based methods, topic modeling, word embeddings, and deep learning.ResultsWe analyze responses from the open‐ended question “How do you feel today?”, detailing characteristics of the responses, as well as predicting past‐year suicidal ideation.ConclusionsWe see the analysis of text from social media, open‐ended questions, and other text sources (i.e., medical records) as an important form of complementary assessment to traditional scales, shedding insight on what we are missing in our current set of questionnaires, which may ultimately serve to improve both our understanding and prediction of suicide.

show abstract

Section: Statistical Algorithmsmentioning

confidence: 99%

The use of text‐based responses to improve our understanding and prediction of suicide risk

Jacobucci

Ammerman

Wilcox

2021

Suicide & Life Threat Behav

View full text Add to dashboard Cite

show abstract

“…Xu et al [30] adopted the Wasserstein distances with a distillation mechanism, to learn topics and word embeddings jointly. Dieng et al [31] utilized the inner product between a word embedding and an embedding of the assigned topic to parameterize a categorical distribution as the word in topic models.…”

Section: Related Workmentioning

confidence: 99%

Neural variational sparse topic model for sparse explainable text representation

Xie

Tiwari

Gupta

et al. 2021

Information Processing & Management

View full text Add to dashboard Cite

Texts are the major information carrier for internet users, from which learning the latent representations has important research and practical value. Neural topic models have been proposed and have great performance in extracting interpretable latent topics and representations of texts. However, there remain two major limitations: 1) these methods generally ignore the contextual information of texts and have limited feature representation ability due to the shallow feed-forward network architecture, 2) Sparsity of the representations in topic semantic space is ignored. To address these issues, in this paper, we propose a semantic reinforcement neural variational sparse topic model (SR-NSTM) towards explainable and sparse latent text representation learning. Compared with existing neural topic models, SR-NSTM models the generative process of texts with probabilistic distributions parameterized with neural networks and incorporates Bi-directional LSTM to embed contextual information at the document level. It achieves sparse posterior representations over documents and words with zero-mean Laplace distribution and topics with sparsemax. Moreover, we propose a supervised extension of SR-NSTM via adding the max-margin posterior regularization to tackle the supervised tasks. The neural variational inference method is utilized to learn our models efficiently. Experimental results on Web Snippets, 20Newsgroups, BBC, and Biomedical datasets demonstrate that the contextual information and revisiting generative process can improve the performance, leading to the competitive performance of our models in learning coherent topics and explainable sparse representations for texts.

show abstract

“…Recently, a lot of work is harnessing topic modeling (Blei et al 2003) along with word vectors to learn better word and sentence representations, e.g., LDA (Chen and Liu 2014), weight-BoC (Kim, Kim, and Cho 2017), TWE , NTSG (Liu, Qiu, and Huang 2015), WTM (Fu et al 2016), w2v-LDA (Nguyen et al 2015, TV+MeanWV (Li et al 2016a), LTSG (Law et al 2017), Gaussian-LDA (Das, Zaheer, and Dyer 2015), Topic2Vec (Niu et al 2015), TM (Dieng, Ruiz, and Blei 2019b), LDA2vec (Moody 2016), D-ETM (Dieng, Ruiz, and Blei 2019a) and MvTM . (Kiros et al 2015) propose skip-thought document embedding vectors which transformed the idea of abstracting the distributional hypothesis from word to sentence level.…”

Section: Related Workmentioning

confidence: 99%

P-SIF: Document Embeddings Using Partition Averaging

Gupta

Saw

Nokhiz

et al. 2020

AAAI

View full text Add to dashboard Cite

Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's topical structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. We provide theoretical justifications on the correctness of P-SIF. Through a comprehensive set of experiments, we demonstrate P-SIF's effectiveness compared to simple weighted averaging and many other baselines.

show abstract

Topic Modeling in Embedding Spaces

Cited by 455 publications

References 27 publications

The use of text‐based responses to improve our understanding and prediction of suicide risk

The use of text‐based responses to improve our understanding and prediction of suicide risk

Neural variational sparse topic model for sparse explainable text representation

P-SIF: Document Embeddings Using Partition Averaging

Contact Info

Product

Resources

About