2015
DOI: 10.1007/s10791-015-9275-x
|View full text |Cite
|
Sign up to set email alerts
|

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

Abstract: Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement o… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 32 publications
0
11
0
Order By: Relevance
“… Cumulative Gain@k (CG@k) is the sum of the gains associated with the first k recommended items in any sequence. Gain is the score assigned to each recommended item based on its relevancy, and CG is the sum of all recommendation outcomes graded relevance scores [ 113 ]. The challenge with CG is that it ignores the result set’s rank when calculating its utility: Discounted Cumulative Gain@k (DCG@k) weighs each recommendation score based on its position.…”
Section: Resultsmentioning
confidence: 99%
“… Cumulative Gain@k (CG@k) is the sum of the gains associated with the first k recommended items in any sequence. Gain is the score assigned to each recommended item based on its relevancy, and CG is the sum of all recommendation outcomes graded relevance scores [ 113 ]. The challenge with CG is that it ignores the result set’s rank when calculating its utility: Discounted Cumulative Gain@k (DCG@k) weighs each recommendation score based on its position.…”
Section: Resultsmentioning
confidence: 99%
“…Disagreement between annotators can signal weaknesses of the annotation scheme, or highlight the inherent ambiguity in what we are trying to measure. Disagreement itself can be meaningful and can be integrated in subsequent analyses (Aroyo and Welty, 2013 ; Demeester et al, 2016 ).…”
Section: Operationalizationmentioning
confidence: 99%
“…The analysis also demonstrates that different effectiveness metrics lead to substantially different withinsystem variances, and therefore topic set size estimates, highlighting the importance of choosing metrics that are appropriate for the search task being analyzed. Demeester et al (2016) address the issue of the generalizability of relevance assessments and how this impacts the reliability of retrieval results in their paper, Predicting Relevance based on Assessor Disagreement: Analysis and Practical Applications for Search Evaluation. When test collections are built, it is common for one, or a small number of people to make relevance assessments of the information objects that have been returned for topics.…”
Section: Overview Of Papersmentioning
confidence: 99%
“…However, this is accepted as one of the limitations of test collection-based evaluation. Demeester et al (2016) introduce the Predicted Relevance Model (PRM), which predicts the relevance of a result for a random user, based on an observed assessment and knowledge of the average disagreement between assessors. The basic idea is that a greater degree of disagreement leads to a more uncertain prediction of relevance.…”
Section: Overview Of Papersmentioning
confidence: 99%