Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback

Hazimeh, Hussein; Zhai, ChengXiang

doi:10.1145/2808194.2809471

Cited by 15 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, α > 0 is the pseudo-count smoothing parameter. Motivated by a Bayesian interpretation of placing a Jeffrey's type Dirichlet prior over multinomial counts, we choose α = 0.5 (Hazimeh and Zhai, 2015;Valcarce et al, 2016;Manning et al, 2008). The quadratic loss is given by the following formula:…”

Section: Classifier Assessmentmentioning

confidence: 99%

A Bayesian Mixture Modelling Approach For Spatial Proteomics

Crook

Mulvey

Kirk

et al. 2018

Preprint

View full text Add to dashboard Cite

1Analysis of the spatial sub-cellular distribution of proteins is of vital importance 2 to fully understand context specic protein function. Some proteins can be found 3 with a single location within a cell, but up to half of proteins may reside in multiple 4 locations, can dynamically re-localise, or reside within an unknown functional com-5 partment. These considerations lead to uncertainty in associating a protein to a single 6 location. Currently, mass spectrometry (MS) based spatial proteomics relies on super-7 vised machine learning algorithms to assign proteins to sub-cellular locations based on 8 common gradient proles. However, such methods fail to quantify uncertainty associ-9 ated with sub-cellular class assignment. Here we reformulate the framework on which 10 we perform statistical analysis. We propose a Bayesian generative classier based on 11 Gaussian mixture models to assign proteins probabilistically to sub-cellular niches, thus 12 proteins have a probability distribution over sub-cellular locations, with Bayesian com-13 putation performed using the expectation-maximisation (EM) algorithm, as well as 14 Markov-chain Monte-Carlo (MCMC). Our methodology allows proteome-wide uncer-15 tainty quantication, thus adding a further layer to the analysis of spatial proteomics. 16 Our framework is exible, allowing many dierent systems to be analysed and reveals 17 new modelling opportunities for spatial proteomics. We nd our methods perform 18 competitively with current state-of-the art machine learning methods, whilst simulta-19 neously providing more information. We highlight several examples where classication 20 based on the support vector machine is unable to make any conclusions, while uncer-21 tainty quantication using our approach provides biologically intriguing results. To our 22 knowledge this is the rst Bayesian model of MS-based spatial proteomics data. 23 * omc25@cam.ac.uk † lg390@cam.ac.uk Author summary 24Sub-cellular localisation of proteins provides insights into sub-cellular biological processes. 25 For a protein to carry out its intended function it must be localised to the correct sub-26 cellular environment, whether that be organelles, vesicles or any sub-cellular niche. Correct 27 sub-cellular localisation ensures the biochemical conditions for the protein to carry out its 28 molecular function are met, as well as being near its intended interaction partners. Therefore, 29 mis-localisation of proteins alters cell biochemistry and can disrupt, for example, signalling 30 pathways or inhibit the tracking of material around the cell. The sub-cellular distribution 31 of proteins is complicated by proteins that can reside in multiple micro-environments, or 32 those that move dynamically within the cell. Methods that predict protein sub-cellular 33 localisation often fail to quantify the uncertainty that arises from the complex and dynamic 34 nature of the sub-cellular environment. Here we present a Bayesian methodology to analyse 35 protein sub-cellular localisation. We ex...

show abstract

Section: Classifier Assessmentmentioning

confidence: 99%

A Bayesian Mixture Modelling Approach For Spatial Proteomics

Crook

Mulvey

Kirk

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…The rapid development of language modeling (LM) provides favorable conditions for the development of effective PRF models (for instance, Ponte & Croft, 1998). A wide range of retrieval approaches based on LM have been proposed (for instance, Lavrenko & Croft, 2001;Lv & Zhai, 2009a;Song & Croft, 1999;Hazimeh & Zhai, 2015;Zhai, 2008), in which feedback documents are always exploited to reestimate a more accurate query language model. For example, Zhai and Lafferty (2001) presented a model-based feedback model, in which two different approaches were evaluated for updating a query language model based on feedback documents: one approach based on a generative probabilistic model of feedback documents and the other one based on the minimization of the KL divergence over feedback documents.…”

Section: Related Workmentioning

confidence: 99%

“…The other traditional class of models we should mention here is the relevance model (RM) framework, which is a wellknown LM-based retrieval framework. It has an intuitive probabilistic interpretation and has been proven to be effective in several empirical studies (for instance, Hazimeh & Zhai, 2015). Two assumptions are adopted in the RM framework: one is that each piece of information related to the topic has an underlying RM, which follows multinomial distribution over words, and the other is that the terms belonging to the query topic and the terms in the feedback documents are randomly sampled according to a distribution R. Generally, RMs could have different forms based on different estimation approaches, and these models do not model the relevant or pseudo-relevant documents in an explicit way.…”

Section: Adaptation Of Traditional Modelsmentioning

confidence: 99%

A simple kernel co‐occurrence‐based enhancement for pseudo‐relevance feedback

Pan

Huang

et al. 2019

Asso for Info Science & Tech

View full text Add to dashboard Cite

Pseudo-relevance feedback is a well-studied query expansion technique in which it is assumed that the topranked documents in an initial set of retrieval results are relevant and expansion terms are then extracted from those documents. When selecting expansion terms, most traditional models do not simultaneously consider term frequency and the co-occurrence relationships between candidate terms and query terms. Intuitively, however, a term that has a higher co-occurrence with a query term is more likely to be related to the query topic. In this article, we propose a kernel co-occurrence-based framework to enhance retrieval performance by integrating term co-occurrence information into the Rocchio model and a relevance language model (RM3). Specifically, a kernel co-occurrence-based Rocchio method (KRoc) and a kernel co-occurrence-based RM3 method (KRM3) are proposed. In our framework, co-occurrence information is incorporated into both the factor of the term discrimination power and the factor of the within-document term weight to boost retrieval performance. The results of a series of experiments show that our proposed methods significantly outperform the corresponding strong baselines over all data sets in terms of the mean average precision and over most data sets in terms of P@10. A direct comparison of standard Text Retrieval Conference data sets indicates that our proposed methods are at least comparable to state-of-the-art approaches.

show abstract

“…When performing pseudo-relevance feedback in retrieval, an axiomatic analysis of RM1 showed that additive smoothing is a better smoothing method than the others because it does not demote the IDF effect. 8 For collaborative filtering, relevance models work better with Absolute Discounting than with Dirichlet priors or Jelinek-Mercer. However, a posterior axiomatic analysis of RM2 for collaborative filtering showed that the IDF effect is related to item novelty in recommendation advocating for the use of additive smoothing in this setting.…”

Section: Additive Smoothing (A)mentioning

confidence: 99%

“…Hazimeh and Zhai studied formally the IDF effect on several state-of-the-art pseudo-relevance feedback techniques based on the language modelling framework (including relevance models). 8 The IDF effect is a heuristic that emphasizes the selection of documents with highly specific terms. They found that the selection of the smoothing method impacts the IDF effect.…”

Section: Introductionmentioning

confidence: 99%

Axiomatic Analysis of Language Modelling of Recommender Systems

Valcarce

Parapar

Barreiro

2017

Int. J. Unc. Fuzz. Knowl. Based Syst.

View full text Add to dashboard Cite

Language Models constitute an effective framework for text retrieval tasks. Recently, it has been extended to various collaborative filtering tasks. In particular, relevance-based language models can be used for generating highly accurate recommendations using a memory-based approach. On the other hand, the query likelihood model has proven to be a successful strategy for neighbourhood computation. Since relevance-based language models rely on user neighbourhoods for producing recommendations, we propose to use the query likelihood model for computing those neighbourhoods instead of cosine similarity. The combination of both techniques results in a formal probabilistic recommender system which has not been used before in collaborative filtering. A thorough evaluation on three datasets shows that the query likelihood model provides better results than cosine similarity. To understand this improvement, we devise two properties that a good neighbourhood algorithm should satisfy. Our axiomatic analysis shows that the query likelihood model always enforces those constraints while cosine similarity does not.

show abstract

Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback

Cited by 15 publications

References 16 publications

A Bayesian Mixture Modelling Approach For Spatial Proteomics

A Bayesian Mixture Modelling Approach For Spatial Proteomics

A simple kernel co‐occurrence‐based enhancement for pseudo‐relevance feedback

Axiomatic Analysis of Language Modelling of Recommender Systems

Contact Info

Product

Resources

About