Establishing Strong Baselines For TripClick Health Retrieval

Hofstätter, Sebastian; Althammer, Sophia; Sertkan, Mete; Hanbury, Allan

doi:10.1007/978-3-030-99739-7_17

Cited by 7 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For TripJudge and TripClick unlabelled documents are considered as irrelevant. In Table 2 the effectiveness metrics as well as the judgements coverage measured as 𝐽 at rank 𝑛 is displayed for various lexical and neural retrieval systems from Hofstätter et al [13]. For TripJudge we see that the coverage measure with J@5 for the runs in the pool (run 1,2,7) is high (around 80%) compared to the coverage of the runs which did not participate in the pooling.…”

Section: System Evaluationmentioning

confidence: 99%

“…For the pool creation we use the runs from Hofstätter et al [13]. In order to have different first stage retrieval methods we use the lexical retrieval run with BM25 [24] (run 1 in Table 2) as well as the SciBERT 𝐷𝑂𝑇 run (run 2 in Table 2) which is based on dense retrieval [3,15].…”

Section: Data and Pool Preparationmentioning

confidence: 99%

“…As the TripClick test sets are based on the clicks of the users, the test sets are biased towards the retrieval model employed by the search engine [34], which remains unknown. In a previous study [13] the test sets were shown to have a low annotation coverage of at most 41% of the Top10 results for lexical and neural retrieval models.…”

Section: Introductionmentioning

confidence: 99%

“…These campaigns follow the Cranfield paradigm [9] to create relevance judgements on the pooled output of the participating systems. Recently there has been a growing interest in evaluating the retrieval performance of retrieval models for domain-specific retrieval tasks [2,13,14,27,36] including the medical domain [22,23,35]. Domain-specific retrieval tasks often lack a reliable test collection with human relevance judgments following the Cranfield paradigm [22,27].…”

Section: Introductionmentioning

confidence: 99%

“…We collect relevance judgements by running an annotation campaign on the test set queries of TripClick. In order to increase the reusability of our test collection [6], we use three participating systems for the pool creation from Hofstätter et al [13] employing lexical and neural retrieval models. To control the quality of the relevance assessments we monitor the annotation time per query, we employ a graded relevance scheme [1,12] and we employ multiple relevance assessments per query-document pair (we aim for three assessments but have at least two).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Althammer,

Hofstätter,

Verberne

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the clicks are biased to the retrieval model used, which remains unknown, and a previous study shows that the test sets have a low judgement coverage for the Top-10 results of lexical and neural retrieval models. In this paper we present the novel, relevance judgement test collection TripJudge for TripClick health retrieval. We collect relevance judgements in an annotation campaign and ensure the quality and reusability of TripJudge by a variety of ranking methods for pool creation, by multiple judgements per query-document pair and by an at least moderate inter-annotator agreement. We compare system evaluation with TripJudge and TripClick and find that that click and judgement-based evaluation can lead to substantially different system rankings. CCS CONCEPTS• Information systems → Test collections.

show abstract

Section: System Evaluationmentioning

confidence: 99%

Section: Data and Pool Preparationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Althammer,

Hofstätter,

Verberne

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Parameter-Efficient Sparse Retrievers and Rerankers Using Adapters

Pal

Lassance²,

Dejean³

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

A Unified Framework for Learned Sparse Retrieval

Nguyen

MacAvaney

Yates

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available. 3

show abstract

Establishing Strong Baselines For TripClick Health Retrieval

Cited by 7 publications

References 15 publications

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Parameter-Efficient Sparse Retrievers and Rerankers Using Adapters

A Unified Framework for Learned Sparse Retrieval

Contact Info

Product

Resources

About