The Reusability of a Diversified Search Test Collection

Sakai, Tetsuya; Dou, Zhicheng; Song, Ruihua; Kando, Noriko

doi:10.1007/978-3-642-35341-3_3

Cited by 11 publications

(14 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it should be noted that diversity test collections are highly unlikely to be reusable [9,11]: thus, if researchers want to continue improving diversified search 10 , we do require a new diversity test collection. Note also that now a new corpus, ClueWeb12, is available [4].…”

Section: Future Directionsmentioning

confidence: 99%

“…Whereas, all runs from LIA and TUTA1 significantly underperformed THUIR-S-E-4A.Chinese Subtopic Mining(Figure 4) TUTA1-S-C-1A outperformed all other runs in terms of Mean D -nDCG, but the six participating teams are statistically indistinguishable from one another 9. The TREC 2011 and 2012 diversity test collections have graded relevance assessments; all TREC diversity test collections(2009- 2012) have the informational and navigational subtopic tags.…”

mentioning

confidence: 97%

See 1 more Smart Citation

Summary of the NTCIR-10 INTENT-2 task

Sakai¹,

Dou²,

Yamamoto

et al. 2013

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

The NTCIR INTENT task comprises two subtasks: Subtopic Mining, where systems are required to return a ranked list of subtopic strings for each given query; and Document Ranking, where systems are required to return a diversified web search result for each given query. This paper summarises the novel features of the Second INTENT task at NTCIR-10 and its main findings, and poses some questions for future diversified search evaluation.

show abstract

Section: Future Directionsmentioning

confidence: 99%

mentioning

confidence: 97%

Summary of the NTCIR-10 INTENT-2 task

Sakai¹,

Dou²,

Yamamoto

et al. 2013

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…The exact cutoff z used for each run is referred to as the pool depth. This strategy tends to find most relevant documents for each topic, but provides no guarantees particularly when entirely new systems are evaluated [33,8,24,23].…”

Section: Introductionmentioning

confidence: 98%

Improving test collection pools with machine learning

Jayasinghe

Webber²,

Sanderson

et al. 2014

Proceedings of the 2014 Australasian Document Computing Symposium

View full text Add to dashboard Cite

IR experiments typically use test collections for evaluation. Such test collections are formed by judging a pool of documents retrieved by a combination of automatic and manual runs for each topic. The proportion of relevant documents found for each topic depends on the diversity across each of the runs submitted and the depth to which runs are assessed (pool depth). Manual runs are commonly believed to reduce bias in test collections when evaluating new IR systems.In this work, we explore alternative approaches to improving test collection reliability. Using fully automated approaches, we are able to recognise a large portion of relevant documents that would normally only be found through manual runs. Our approach combines simple fusion methods with machine learning. The approach demonstrates the potential to find many more relevant documents than are found using traditional pooling approaches. Our initial results are promising and can be extended in future studies to help test collection curators ensure proper judgment coverage is maintained across the entire document collection.

show abstract

“…Actually, there exist another option that we can reuse the historical labels in evaluation to save the labeling efforts. Nevertheless, due to the existence of the unlabeled documents, current measures for novelty and diversity are not reusable [26].…”

Section: Introductionmentioning

confidence: 99%

Towards Robust & Reusable Evaluation for Novelty & Diversity

Hui

2014

Proceedings of the 7th Workshop on Ph.D Students

View full text Add to dashboard Cite

Existing IR measures for offline evaluation directly bring in the labels into computation, where the labels are on the entire documents. This direct dependency makes the measure highly reliant on the completeness of the labels, consequently the measure values are sensitive towards missing labels, resulting in poor robustness and reusability. To mitigate this, we propose a novel evaluation approach, constructing an intermediate layer between the labels and the measure, improving the robustness and reusability by dampening the direct dependency, as well as considering the content of the document in the measure computation. In particular, we propose to estimate a language model based on a selected relevant document set to construct a ground truth, afterward using the divergence between the search result and this ground truth to compute measures. To further save labeling efforts and to improve efficiency, we select representative documents, query set and topic terms involved in the evaluation separately before computing the measure. Preliminary experiments on the diversity tasks of TREC Web Track 2009-2012, using ClueWeb09-A as a document collection, show that with as little as 30% of judgments our novel approach almost accurately reconstructs the original system rankings determined by α-nDCG, ERR-IA, and NRBP.

show abstract

The Reusability of a Diversified Search Test Collection

Cited by 11 publications

References 29 publications

Summary of the NTCIR-10 INTENT-2 task

Summary of the NTCIR-10 INTENT-2 task

Improving test collection pools with machine learning

Towards Robust & Reusable Evaluation for Novelty & Diversity

Contact Info

Product

Resources

About