David Andrzejewski scite author profile

Users of topic modeling methods often have knowledge about the composition of words that should have high or low probability in various topics. We incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. We present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model’s ability to follow and generalize beyond user-specified domain knowledge.

show abstract

Statistical Debugging Using Latent Topic Models

Andrzejewski

Mulhern

Liblit

et al.

View full text Add to dashboard Cite

Abstract. Statistical debugging uses machine learning to model program failures and help identify root causes of bugs. We approach this task using a novel Delta-Latent-Dirichlet-Allocation model. We model execution traces attributed to failed runs of a program as being generated by two types of latent topics: normal usage topics and bug topics. Execution traces attributed to successful runs of the same program, however, are modeled by usage topics only. Joint modeling of both kinds of traces allows us to identify weak bug topics that would otherwise remain undetected. We perform model inference with collapsed Gibbs sampling. In quantitative evaluations on four real programs, our model produces bug topics highly correlated to the true bugs, as measured by the Rand index. Qualitative evaluation by domain experts suggests that our model outperforms existing statistical methods for bug cause identification, and may help support other software tasks not addressed by earlier models.

show abstract

Latent Dirichlet Allocation with topic-in-set knowledge

Andrzejewski¹,

Zhu²

2009

126

View full text Add to dashboard Cite

Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-inset knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise. Preliminary experiments on text datasets are presented to demonstrate the potential effectiveness of this method.

show abstract

Latent topic feedback for information retrieval

Andrzejewski

Buttler

2011

View full text Add to dashboard Cite

We consider the problem of a user navigating an unfamiliar corpus of text documents where document metadata is limited or unavailable, the domain is specialized, and the user base is small. These challenging conditions may hold, for example, within an organization such as a business or government agency. We propose to augment standard keyword search with user feedback on latent topics. These topics are automatically learned from the corpus in an unsupervised manner and presented alongside search results. User feedback is then used to reformulate the original query, resulting in improved information retrieval performance in our experiments.

show abstract

Accelerated Gibbs Sampling for Infinite Sparse Factor Analysis

Andrzejewski

2011

View full text Add to dashboard Cite

The Indian Buffet Process (IBP) gives a probabilistic model of sparse binary matrices with an unbounded number of columns. This construct can be used, for example, to model a fixed numer of observed data points (rows) associated with an unknown number of latent features (columns). Markov Chain Monte Carlo (MCMC) methods are often used for IBP inference, and in this technical note, we provide a detailed review of the derivations of collapsed and accelerated Gibbs samplers for the linear-Gaussian infinite latent feature model. We also discuss and explain update equations for hyperparameter resampling in a "full Bayesian" treatment and present a novel slice sampler capable of extending the accelerated Gibbs sampler to the case of infinite sparse factor analysis by allowing the use of real-valued latent features.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.