LDA-based document models for ad-hoc retrieval

Wei, Xing; Croft, W. Bruce

doi:10.1145/1148170.1148204

Cited by 825 publications

(606 citation statements)

References 15 publications

Supporting

Mentioning

591

Contrasting

Unclassified

Order By: Relevance

“…It has been shown to perform as well or better than many other popular techniques for machine learning, data mining, and supervised and unsupervised classification of data. Indeed, LDA has been found to have a similar running time for processing as k-means (Wei and Croft 2006), a long-used approach for unsupervised clustering, which lacks LDA's capability to associate documents with a distribution over topics rather than assignment of each document to a single, unique topic. Modifications, extensions, improvements, and additions to LDA are being developed and released at a rapid pace; some relevant extensions are discussed later in this article.…”

Section: Latent Dirichlet Allocationmentioning

confidence: 99%

Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing

Groß

Murthy

2014

Neural Networks

View full text Add to dashboard Cite

This paper explores a variety of methods for applying the Latent Dirichlet Allocation (LDA) automated topic modeling algorithm to the modeling of the structure and behavior of virtual organizations found within modern social media and social networking environments. As the field of Big Data reveals, an increase in the scale of social data available presents new challenges which are not tackled by merely scaling up hardware and software. Rather, they necessitate new methods and, indeed, new areas of expertise. Natural language processing provides one such method. This paper applies LDA to the study of scientific virtual organizations whose members employ social technologies. Because of the vast data footprint in these virtual platforms, we found that natural language processing was needed to 'unlock' and render visible latent, previously unseen conversational connections across large textual corpora (spanning profiles, discussion threads, forums, and other social media incarnations). We introduce variants of LDA and ultimately make the argument that natural language processing is a critical interdisciplinary methodology to make better sense of social 'Big Data' and we were able to successfully model nested discussion topics from forums and blog posts using LDA. Importantly, we found that LDA can move us beyond the state-of-the-art in conventional Social Network Analysis techniques.

show abstract

Section: Latent Dirichlet Allocationmentioning

confidence: 99%

Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing

Groß

Murthy

2014

Neural Networks

View full text Add to dashboard Cite

show abstract

“…The explicit assumption about the prior sources of these variables provides complete generative semantics for the model [2][6] [16]. Moreover, the mathematical property that the Dirichlet priors of p(θ d | α) and p(Ф z | β) are conjugate to their likelihoods (multinomial distributions) p(z| θ d ) and p(w| Ф z ) results in the fact that their posteriors p(θ d | α, {z i | for all tokens in doc d}) and p(Ф z | β, {w i | for all tokens generated by z}) are also Dirichlet distributions.…”

Section: Fig 1 Latent Dirichlet Relevance Modelmentioning

confidence: 99%

“…In LDA, topic proportion of every document is a K-dimensional hidden variable randomly drawn from the same Dirichlet distribution, where K is the number of topics. Thus, generative semantics of LDA are complete [16]. LDA and its variants have been applied in many applications such as finding scientific topics [6], E-community discovery [18], mixedmembership analysis [5] and ad-hoc retrieval for representing document language model [4] [16].…”

Section: Related Workmentioning

confidence: 99%

“…Thus, generative semantics of LDA are complete [16]. LDA and its variants have been applied in many applications such as finding scientific topics [6], E-community discovery [18], mixedmembership analysis [5] and ad-hoc retrieval for representing document language model [4] [16]. However, a common problem of both pLSI and LDA is their inability to model the concept of relevance, which is key in information retrieval [7][13] [14].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Latent Dirichlet Framework for Relevance Modeling

Ha-Thuc

Srinivasan

2009

Information Retrieval Technology

View full text Add to dashboard Cite

Abstract. Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. This could limit model robustness and effectiveness. In this study, we propose a Latent Dirichlet relevance model, which relaxes this assumption. Our approach derives from current research on Latent Dirichlet Allocation (LDA) topic models. LDA has been extensively explored, especially for generating a set of topics from a corpus. A key attraction is that in LDA a document may be about several topics. LDA itself, however, has a limitation that is also addressed in our work. Topics generated by LDA from a corpus are synthetic, i.e., they do not necessarily correspond to topics identified by humans for the same corpus. In contrast, our model explicitly considers the relevance relationships between documents and given topics (queries). Thus unlike standard LDA, our model is directly applicable to goals such as relevance feedback for query modification and text classification, where topics (classes and queries) are provided upfront. Thus although the focus of our paper is on improving relevance-based language models, in effect our approach bridges relevance-based language models and LDA addressing limitations of both. Finally, we propose an idea that takes advantage of "bagof-words" assumption to reduce the complexity of Gibbs sampling based learning algorithm.

show abstract

“…LDA is an intensively studied model, and the experiments are really impressive compared to other known information retrieval techniques. The applications of LDA include entity resolution [4], fraud detection in telecommunication systems [5], image processing [6,7,8] and ad-hoc retrieval [9].…”

Section: Introductionmentioning

confidence: 99%

Latent Dirichlet Allocation for Automatic Document Categorization

Bíró

Szabó

2009

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build separate LDA models for each category with the category-specific topics, and then these topic collections are put together to form a unified LDA model. For an unseen document the inferred topic distribution gives an estimation how much the document fits into the category.We use this method for Web document classification. Our key results are 46% decrease in 1-AUC value in classification accuracy over tf.idf with SVM and 43% over the plain LDA baseline with SVM. Using a careful vocabulary selection method and a heuristic which handles the effect that similar topics may arise in distinct categories the improvement is 83% over tf.idf with SVM and 82% over LDA with SVM in 1-AUC.

show abstract

LDA-based document models for ad-hoc retrieval

Cited by 825 publications

References 15 publications

Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing

Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing

A Latent Dirichlet Framework for Relevance Modeling

Latent Dirichlet Allocation for Automatic Document Categorization

Contact Info

Product

Resources

About