Low-cost and robust evaluation of information retrieval systems

Carterette, Benjamin

doi:10.1145/1480506.1480527

Cited by 9 publications

(16 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Carterette showed that the mean and variance for precision at k and average precision have analytical forms [6]. Given a query Q ∈ Q, these analytical forms are:…”

Section: Interval Estimates Of Reusabilitymentioning

confidence: 99%

See 1 more Smart Citation

Measuring the reusability of test collections

Carterette

Gabrilovich

Josifovski

et al. 2010

Proceedings of the Third ACM International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

While test collection construction is a time-consuming and expensive process, the true cost is amortized by reusing the collection over hundreds or thousands of experiments. Some of these experiments may involve systems that retrieve documents not judged during the initial construction phase, and some of these systems may be "hard" to evaluate: depending on which judgments are missing and which judged documents were retrieved, the experimenter's confidence in an evaluation could potentially be very low. We propose two methods for quantifying the reusability of a test collection for evaluating new systems. The proposed methods provide simple yet highly effective tests for determining whether an existing set of judgments is useful for evaluating a new system. Empirical evaluations using TREC datasets confirm the usefulness of our proposed reusability measures. In particular, we show that our methods can reliably estimate confidence intervals that are indicative of collection reusability.

show abstract

“…Carterette showed that the mean and variance for precision at k and average precision have analytical forms [6]. Given a query Q ∈ Q, these analytical forms are:…”

Section: Interval Estimates Of Reusabilitymentioning

confidence: 99%

“…Although we primarily focus on precision at k and average precision in this paper, it should be noted that analytical forms for the means and variances of other retrieval metrics exist, including recall and NDCG [6]. Thus, our intervalbased reusability measures can be easily applied to these metrics, as well.…”

Section: Confidence Intervalsmentioning

confidence: 99%

Measuring the reusability of test collections

Carterette

Gabrilovich

Josifovski

et al. 2010

Proceedings of the Third ACM International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

show abstract

“…This has led researchers to pursue low cost strategies for constructing manual test collections. Two emerging evaluation paradigms are minimal test collections [8,7,6] and crowdsourcing [2]. Both of these strategies are useful for low-cost one-time evaluations.…”

Section: Pseudo Test Collectionsmentioning

confidence: 99%

“…The queries are either sampled from query logs or manually generated. Each query is then issued to one or more retrieval systems, which returns candidate documents that are then judged, either via pooling [21,37], the minimal test collection paradigm [8,7,6], or crowdsourcing [2].…”

Section: Pseudo Test Collectionsmentioning

confidence: 99%

Pseudo test collections for learning web search ranking functions

Asadi

Metzler

Elsayed

et al. 2011

Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Test collections are the primary drivers of progress in information retrieval. They provide yardsticks for assessing the effectiveness of ranking functions in an automatic, rapid, and repeatable fashion and serve as training data for learning to rank models. However, manual construction of test collections tends to be slow, labor-intensive, and expensive. This paper examines the feasibility of constructing web search test collections in a completely unsupervised manner given only a large web corpus as input. Within our proposed framework, anchor text extracted from the web graph is treated as a pseudo query log from which pseudo queries are sampled. For each pseudo query, a set of relevant and non-relevant documents are selected using a variety of webspecific features, including spam and aggregated anchor text weights. The automatically mined queries and judgments form a pseudo test collection that can be used for training ranking functions. Experiments carried out on TREC web track data show that learning to rank models trained using pseudo test collections outperform an unsupervised ranking function and are statistically indistinguishable from a model trained using manual judgments, demonstrating the usefulness of our approach in extracting reasonable quality training data "for free".

show abstract

“…However, in more constrained research environments these options are not available, and relevance judgments are usually provided by humans. To reduce the cost of this potentially expensive process, researchers have developed low-cost evaluation strategies, including minimal test collections [2] and crowdsourcing [1]. Despite the usefulness of these strategies, the resulting relevance judgments cannot easily be "ported" to a new or different corpus.…”

Section: Introductionmentioning

confidence: 99%