2021
DOI: 10.1109/access.2021.3116857
|View full text |Cite
|
Sign up to set email alerts
|

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Abstract: Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens [103], it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those sc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 43 publications
(13 citation statements)
references
References 104 publications
(123 reference statements)
0
13
0
Order By: Relevance
“…Specifically, we find that across all four datasets, there are significant differences in performance when training with 𝑟 = 1 compared to the full dataset. Despite large absolute differences for RR on DL 19 and 20, the measure is statistically unstable [7] (especially with few queries available for TREC DL), and did not result in significance. In some cases, there is also statistically significant differences between 𝑟 = 5 and the full dataset (Dev RR@10, DL20 AP, and Robust04 nDCG@20).…”
Section: Resultsmentioning
confidence: 83%
See 1 more Smart Citation
“…Specifically, we find that across all four datasets, there are significant differences in performance when training with 𝑟 = 1 compared to the full dataset. Despite large absolute differences for RR on DL 19 and 20, the measure is statistically unstable [7] (especially with few queries available for TREC DL), and did not result in significance. In some cases, there is also statistically significant differences between 𝑟 = 5 and the full dataset (Dev RR@10, DL20 AP, and Robust04 nDCG@20).…”
Section: Resultsmentioning
confidence: 83%
“…Given a minimum rank 𝑟 , we filter down the official MS MARCO training triple sequence to only samples where the positive query-document pair appears within the top 𝑟 documents presented to the annotator. 7 Given the high effectiveness and ease of training of the monoT5 model, we select it as a representative ranking model for this experiment. Using both 𝑟 = 1 and 𝑟 = 5, we train a monoT5 over 256k samples, a mini-batch size of 8, and a learning rate of 5 × 10 −5 .…”
Section: Methodsmentioning
confidence: 99%
“…In this section, we provide only the background required to fully understand of the work reported in the current paper. In particular, Craswell et al [11] address concerns raised by Ferrante et al [12] who apply measurement theory to draw attention to important shortcomings of established evaluation measures, such as MRR. Many of these measures are not interval scaled, and therefore many common statistical tests are not permissible, and properly these measures should not even be averaged.…”
Section: Ms Marcomentioning
confidence: 99%
“…All reported results are averages over 30 independent runs performed under identical circumstances. We report Normalized 𝐷𝐶𝐺@𝐾 (NDCG@K) as our ranking performance metric computed on the held-out test-set of each dataset; following the advice of Ferrante et al [9], we do not use querynormalization but dataset-normalization: we divide the 𝐷𝐶𝐺@𝐾 of a ranking model by the maximum possible 𝐷𝐶𝐺@𝐾 value on the entire test-set of the dataset.…”
Section: Methodsmentioning
confidence: 99%