Retrieval testing with hypergeometric document models

Wilbur, W. John

doi:10.1002/(sici)1097-4571(199307)44:6<340::aid-asi4>3.0.co;2-z

Cited by 3 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wilbur [39] was the first to use the central hypergeometric distribution in a retrieval setting. The vocabulary overlap of two documents is modeled as a hypergeometric distribution for determining the relevance to each other.…”

Section: Related Workmentioning

confidence: 99%

Hypergeometric language models for republished article finding

Tsagkias

Rijke

Weerkamp

2011

Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates.To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and noncentral hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution.We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.

show abstract

Section: Related Workmentioning

confidence: 99%

Hypergeometric language models for republished article finding

Tsagkias

Rijke

Weerkamp

2011

Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

show abstract

“…It is only recently that the hypergeometric distribution was highlighted in connection with IR. Besides an 'older' article by Wilbur [13], which uses the hypergeometric distribution as a tool to describe the probability that a document is related to a query, we only know of the recent articles by Shaw et al [14] and Egghe and Rousseau [15]. As we feel that this distribution is basic to all quantitative approaches to IR, we present the simple argument which leads to its use.…”

Section: Duality and The Hypergeometric Distributionmentioning

confidence: 99%

Duality in information retrieval and the hypergeometric distribution

Egghe

Rousseau²

1997

Journal of Documentation

View full text Add to dashboard Cite

Duality is an important topic in informetrics, especially in connection with the classical informetric laws. Yet, this concept is less studied in information retrieval. It deals with the unification or symmetry between queries and documents, retrieval versus indexing, and relevant versus retrieved documents. These ideas are elaborated in this note and the connection with the hypergeometric distribution is highlighted.

show abstract

A New Term Significance Weighting Approach

Zhang

Nguyen

2005

J Intell Inf Syst

View full text Add to dashboard Cite

Retrieval testing with hypergeometric document models

Cited by 3 publications

References 11 publications

Hypergeometric language models for republished article finding

Hypergeometric language models for republished article finding

Duality in information retrieval and the hypergeometric distribution

A New Term Significance Weighting Approach

Contact Info

Product

Resources

About