Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2020
DOI: 10.1145/3397271.3401212
|View full text |Cite
|
Sign up to set email alerts
|

Sampling Bias Due to Near-Duplicates in Learning to Rank

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 12 publications
(4 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…This allows us to develop and maintain services such as the ChatNoir web search engine [Be18], the argument search engine args.me [Wa17], but also facilitates individual research projects which rely on utilizing the bundled computational power and storage capacity of the capable Hadoop, Kubernetes, and Ceph clusters. A few recent examples are the analysis of massive query logs [Bo20], a detailed inquiry into near-duplicates in web search [Fr20b,Fr20a], the preparation, parsing, and assembly of large NLP corpora [Be20,Ke20], and the exploration of large transformer models for argument retrieval [AP20].…”
Section: Discussionmentioning
confidence: 99%

Web Archive Analytics

Völske,
Bevendorff,
Kiesel
et al. 2021
Preprint
Self Cite
“…This allows us to develop and maintain services such as the ChatNoir web search engine [Be18], the argument search engine args.me [Wa17], but also facilitates individual research projects which rely on utilizing the bundled computational power and storage capacity of the capable Hadoop, Kubernetes, and Ceph clusters. A few recent examples are the analysis of massive query logs [Bo20], a detailed inquiry into near-duplicates in web search [Fr20b,Fr20a], the preparation, parsing, and assembly of large NLP corpora [Be20,Ke20], and the exploration of large transformer models for argument retrieval [AP20].…”
Section: Discussionmentioning
confidence: 99%

Web Archive Analytics

Völske,
Bevendorff,
Kiesel
et al. 2021
Preprint
Self Cite
“…Prior works have suggested several ways to understand the relationship between retrieval effectiveness and quality of test collections via empirical analyses. For instance, train-test leakage [38], retrievability bias due to query length [68], sampling bias due to near-duplicates [22], or saturated leaderboards unable to distinguish any meaningful improvements [3] were examined. However, prior work has missed out on evaluating the impact of document corpora on retrieval effectiveness, i.e., the potential impact of non-relevant documents present within a corpus on neural models.…”
Section: Background and Related Workmentioning
confidence: 99%
“…This leakage of information is not possible during the training of learning to rank models because the train/test partitioning is done per query. Still, not removing near-duplicates during the training of learning to rank models decreases the effectiveness of models and biases the trained models [13].…”
Section: Risk: Training Of Learning To Rank Modelsmentioning
confidence: 99%
“…A study [13] on the ClueWeb09 with 42 ranking features using popular algorithms finds that near-duplicates in the training data harm the retrieval performance, since the presence of nearduplicates is unaccounted for in the loss-minimization of learning to rank and in subsequent evaluations. Furthermore, by varying the number of Wikipedia near-duplicates in the training set, the study showed that models might be biased towards retrieving nearduplicates at higher positions.…”
Section: Risk: Training Of Learning To Rank Modelsmentioning
confidence: 99%