Sampling Bias Due to Near-Duplicates in Learning to Rank

Fröbe, Maik; Bevendorff, Janek; Reimer, Jan Heinrich; Potthast, Martin; Hagen, Matthias

doi:10.1145/3397271.3401212

Cited by 12 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This allows us to develop and maintain services such as the ChatNoir web search engine [Be18], the argument search engine args.me [Wa17], but also facilitates individual research projects which rely on utilizing the bundled computational power and storage capacity of the capable Hadoop, Kubernetes, and Ceph clusters. A few recent examples are the analysis of massive query logs [Bo20], a detailed inquiry into near-duplicates in web search [Fr20b,Fr20a], the preparation, parsing, and assembly of large NLP corpora [Be20,Ke20], and the exploration of large transformer models for argument retrieval [AP20].…”

Section: Discussionmentioning

confidence: 99%

Web Archive Analytics

Völske,

Bevendorff,

Kiesel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes-to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.

show abstract

Section: Discussionmentioning

confidence: 99%

Web Archive Analytics

Völske,

Bevendorff,

Kiesel

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Prior works have suggested several ways to understand the relationship between retrieval effectiveness and quality of test collections via empirical analyses. For instance, train-test leakage [38], retrievability bias due to query length [68], sampling bias due to near-duplicates [22], or saturated leaderboards unable to distinguish any meaningful improvements [3] were examined. However, prior work has missed out on evaluating the impact of document corpora on retrieval effectiveness, i.e., the potential impact of non-relevant documents present within a corpus on neural models.…”

Section: Background and Related Workmentioning

confidence: 99%

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Thakur,

Bonifacio,

Fröbe

et al. 2024

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark-a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at https://github.com/castorini/touche-error-analysis. CCS CONCEPTS• Information systems → Retrieval models and ranking; Evaluation of retrieval results.

show abstract

“…This leakage of information is not possible during the training of learning to rank models because the train/test partitioning is done per query. Still, not removing near-duplicates during the training of learning to rank models decreases the effectiveness of models and biases the trained models [13].…”

Section: Risk: Training Of Learning To Rank Modelsmentioning

confidence: 99%

“…A study [13] on the ClueWeb09 with 42 ranking features using popular algorithms finds that near-duplicates in the training data harm the retrieval performance, since the presence of nearduplicates is unaccounted for in the loss-minimization of learning to rank and in subsequent evaluations. Furthermore, by varying the number of Wikipedia near-duplicates in the training set, the study showed that models might be biased towards retrieving nearduplicates at higher positions.…”

Section: Risk: Training Of Learning To Rank Modelsmentioning

confidence: 99%

The Impact of Main Content Extraction on Near-Duplicate Detection

Fröbe¹,

Hagen²,

Bevendorff³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and potential of near-duplicates with an information retrieval focus, motivating that current efforts toward an open and independent European web search infrastructure should maintain metadata on duplicate and near-duplicate documents in its index.Near-duplicate detection implemented in an open web search infrastructure should provide a suitable similarity threshold, a difficult choice since identical pages may substantially differ in parts of a page that are irrelevant to searchers (templates, advertisements, etc.). We study this problem by comparing the similarity of pages for five (main) content extraction methods in two studies on the ClueWeb crawls. We find that the full content of pages serves precision-oriented near-duplicate-detection, while main content extraction is more recall-oriented.

show abstract

Sampling Bias Due to Near-Duplicates in Learning to Rank

Cited by 12 publications

References 22 publications

Web Archive Analytics

Web Archive Analytics

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

The Impact of Main Content Extraction on Near-Duplicate Detection

Contact Info

Product

Resources

About