2020
DOI: 10.1007/978-3-030-45442-5_2
|View full text |Cite
|
Sign up to set email alerts
|

The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

Abstract: Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 15 publications
0
5
0
Order By: Relevance
“…This allows us to develop and maintain services such as the ChatNoir web search engine [Be18], the argument search engine args.me [Wa17], but also facilitates individual research projects which rely on utilizing the bundled computational power and storage capacity of the capable Hadoop, Kubernetes, and Ceph clusters. A few recent examples are the analysis of massive query logs [Bo20], a detailed inquiry into near-duplicates in web search [Fr20b,Fr20a], the preparation, parsing, and assembly of large NLP corpora [Be20,Ke20], and the exploration of large transformer models for argument retrieval [AP20].…”
Section: Discussionmentioning
confidence: 99%

Web Archive Analytics

Völske,
Bevendorff,
Kiesel
et al. 2021
Preprint
Self Cite
“…This allows us to develop and maintain services such as the ChatNoir web search engine [Be18], the argument search engine args.me [Wa17], but also facilitates individual research projects which rely on utilizing the bundled computational power and storage capacity of the capable Hadoop, Kubernetes, and Ceph clusters. A few recent examples are the analysis of massive query logs [Bo20], a detailed inquiry into near-duplicates in web search [Fr20b,Fr20a], the preparation, parsing, and assembly of large NLP corpora [Be20,Ke20], and the exploration of large transformer models for argument retrieval [AP20].…”
Section: Discussionmentioning
confidence: 99%

Web Archive Analytics

Völske,
Bevendorff,
Kiesel
et al. 2021
Preprint
Self Cite
“…Strategic providers can create duplicates or near-duplicates-possibly hard to automatically identifyof their existing offerings in a ranking system. Since certain fair ranking mechanisms may try to ensure benefits for all listed items, providers with more copies of same items stand to gain more benefits [57,43]. We give such an example in Appendix Table 4.…”
Section: Strategic Behaviormentioning
confidence: 99%
“…on the RS to improve their ranking [175], influencers on social network sites could try to hijack popular trends [65,32]. Providers can even strategically exploit the deployed fair ranking mechanisms to extract more benefits [57,43]. Not factoring in such strategic behavior could impact ranking and recommendation systems, and especially the performance of fair ranking mechanisms.…”
Section: Strategic Behaviormentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, they introduce the so-called novelty principle, which states that a document, though relevant in isolation, is irrelevant if it is a near-duplicate to a document the user has already seen on the search engine result page. Especially on web crawls with many near-duplicates, the novelty principle has a non-negligible impact on evaluating search engines [14]. E.g., applying the novelty principle on the runs submitted to the Terabyte track 2004 decreases mean average precision scores by 20% on average [6].…”
Section: Risk: Evaluation Of Search Enginesmentioning
confidence: 99%