Intent-based diversification of web search results: metrics and algorithms

Chapelle, Olivier; Ji, Shihao; Liao, Ciya; Velipasaoglu, Emre; Lai, Larry; Wu, Su‐Lin

doi:10.1007/s10791-011-9167-7

Cited by 90 publications

(79 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 4 provides some information on the evaluation measures that were used in the present study. For the adhoc/news and adhoc/web tasks, we consider the binary Average Precision (AP), Q-measure (Q) (Sakai 2005), normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen 2002) and normalised Expected Reciprocal Rank (nERR) (Chapelle et al 2011), all computed using the NTCIREVAL toolkit.…”

Section: Methodsmentioning

confidence: 99%

“…For the diversity/web task, we consider a-nDCG (Clarke et al 2009) and Intent-Aware nERR (nERR-IA) (Chapelle et al 2011) computed using ndeval, 28 as well as D-nDCG and D]-nDCG (Sakai and Song 2011) computed using NTCIREVAL. When using NTCIREVAL, the gain value for each LX-relevant document was set to gðrÞ ¼ 2 x À 1: for example, the gain for an L3-relevant document is 7, while that for an L1-relevant document is 1.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Topic set size design

Sakai

2015

Inf Retrieval J

View full text Add to dashboard Cite

Traditional pooling-based information retrieval (IR) test collections typically have n ¼ 50-100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata's three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Topic set size design

Sakai

2015

Inf Retrieval J

View full text Add to dashboard Cite

show abstract

“…It is a shame that such expensive collections are not reusable, although we have shown that condensed-list metrics may provide more accurate results for non-contributors than traditional metrics. Given this situation, one useful future direction for diversity evaluation would be to establish a metholodogy for efficient and economical construction of disposable diversity test collections: instead of explicitly defining a set of possible intents for each topic a priori 5 , would it be possible to automatically extract implicit intents from a given set of systems and rank them by "relative diversity"? Would the relative diversity correlate well with the users' diversity preferences?…”

Section: Discussionmentioning

confidence: 99%

“…Clarke et al [9] and Chapelle et al [5] have independently described α-nDCG and ERR-IA in a single framework. What distinguishes α-nDCG and ERR-IA from other diversity metrics is their per-intent diminishing return property [5]: every time a document relevant to an intent is found, the value of the next document found that is relevant to the same intent is discounted. Thus these metrics penalise redundant information for each intent, and thereby encourages diversity across intents.…”

Section: Diversity Evaluation Metricsmentioning

confidence: 99%

The Reusability of a Diversified Search Test Collection

Sakai

Dou

Song

et al. 2012

Information Retrieval Technology

View full text Add to dashboard Cite

Traditional "ad hoc" test collections, typically built based on depth-100 pools, are often used a posteriori by non-contributors, i.e., research groups that did not contribute the pools. The Leave One Out (LOO) test is useful for testing whether the test collections are actually reusable: that is, whether the non-contributors can be evaluated fairly relative to the contributors' official performances. In contrast, at the recent web search result diversification tasks of TREC and NTCIR, diversity test collections have been built using shallow pools: the pool depths lie between 20 and 40. Thus it is unlikely that these diversity test collections are reusable: in fact, the organisers of these diversity tasks never claimed that they are. Nevertheless, these collections are also used a posteriori by non-contributors. In light of this, Sakai et al. [21] demonstrated by means of LOO tests that the NTCIR-9 INTENT-1 Chinese diversity test collection is not reusable, and also showed that condensed-list evaluation metrics generally provide better estimates of the noncontributors' true performances than raw evaluation metrics. This paper generalises and strengthens their findings through LOO tests with the latest TREC 2012 diversity test collection.

show abstract

“…Rafiei et al [12] modeled the diversity problem as expectation maximization and presented algorithms to estimate the optimization parameters. In [4], documents are selected sequentially according to relevance. The relevance is conditioned on documents having been already selected.…”

Section: Introductionmentioning

confidence: 99%

mNIR: Diversifying Search Results Based on a Mixture of Novelty, Intention and Relevance

Hemayati

Dehkordi

Meng

2012

Web Information Systems Engineering - WISE 2012

View full text Add to dashboard Cite

ABSTRACT. Current search engines do not explicitly take different meanings and usages of user queries into consideration when they rank the search results. As a result, they tend to retrieve results that cover the most popular meanings or usages of the query. Consequently, users who want results that cover a rare meaning or usage of query or results that cover all different meanings/usages may have to go through a large number of results in order to find the desired ones. Another problem with current search engines is that they do not adequately take users' intention into consideration. In this paper, we introduce a novel result ranking algorithm (mNIR) that explicitly takes result novelty, user intention-based distribution and result relevancy into consideration and mixes them to achieve better result ranking. We analyze how giving different emphasis to the above three aspects would impact the overall ranking of the results. Our approach builds on our previous method for identifying and ranking possible categories of any user query based on the meanings and usages of the terms and phrases within the query. These categories are also used to generate category queries for retrieving results matching different meanings/usages of the original user query. Our experimental results show that the proposed algorithm can outperform state-of-the-art diversification approaches.

show abstract

Intent-based diversification of web search results: metrics and algorithms

Cited by 90 publications

References 21 publications

Topic set size design

Topic set size design

The Reusability of a Diversified Search Test Collection

mNIR: Diversifying Search Results Based on a Mixture of Novelty, Intention and Relevance

Contact Info

Product

Resources

About