Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021
DOI: 10.1145/3404835.3463254
|View full text |Cite
|
Sign up to set email alerts
|

Simplified Data Wrangling with ir_datasets

Abstract: Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

2
6

Authors

Journals

citations
Cited by 61 publications
(29 citation statements)
references
References 42 publications
0
29
0
Order By: Relevance
“…Finally, we ask the question: what if we can afford to translate MS MARCO so that we can use a translate-train model? To investigate, we utilize the Chinese translation of the MSMARCO-v1 training triples from ColBERT-X [32], which can also be accessed via ir_datasets [30] with the dataset key neumarco/zh 1 . Figure 2 shows that without C3, the ColBERT model improves from 0.352 to 0.421, which is still worse than zero-shot transfer models trained with C3 for CLIR, suggesting allocating effort to C3 rather than training a translation model when computational resources are limited.…”
Section: Results and Analysismentioning
confidence: 99%
“…Finally, we ask the question: what if we can afford to translate MS MARCO so that we can use a translate-train model? To investigate, we utilize the Chinese translation of the MSMARCO-v1 training triples from ColBERT-X [32], which can also be accessed via ir_datasets [30] with the dataset key neumarco/zh 1 . Figure 2 shows that without C3, the ColBERT model improves from 0.352 to 0.421, which is still worse than zero-shot transfer models trained with C3 for CLIR, suggesting allocating effort to C3 rather than training a translation model when computational resources are limited.…”
Section: Results and Analysismentioning
confidence: 99%
“…For TREC graded relevance (0 = non relevant to 3 = perfect), we use the recommended binarization point of 2 for the recall metric. For out of domain experiments we refer to the ir_datasets catalogue [37] for collection specific information, as we utilized the standardized test sets for the collections.…”
Section: Passage Collection and Query Setsmentioning
confidence: 99%
“…Methodology. We selected seven datasets from the ir_datasets catalogue [37]: Bio medical (TREC Covid [50,52], TripClick [40], NFCorpus [4]), Entity centric (DBPedia Entity [14]), informal language (Antique [13], TREC Podcast [23]), news cables (TREC Robust 04 [49]). The datasets are not based on web collections, have at least 50 queries, and importantly contain judgements from both relevant and non-relevant categories.…”
Section: Out-of-domain Robustnessmentioning
confidence: 99%
“…It can be used as a command line tool and as a Python package that can be integrated with other tools. DiffIR is model-agnostic; in its most basic setting, it simply accepts TREC-formatted run files and an ir_datasets [32] dataset identifier to generate an HTML output. Metrics are calculated using pytrec_eval [39] via the ir_measures 3 package.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…DiffIR can be run locally using the command: In the above command, run_1 and run_2 are files that contain the document rankings for each query and uses the standard TREC run format. The user must specify a dataset name supported by ir_datasets [32]. In the sample command above, DiffIR would select the top ten queries whose mean average precision varies the most between the two run files and renders the content as HTML.…”
Section: Demonstrationmentioning
confidence: 99%