Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022
DOI: 10.1145/3477495.3536321
|View full text |Cite
|
Sign up to set email alerts
|

ClueWeb22: 10 Billion Web Documents with Rich Information

Abstract: ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(7 citation statements)
references
References 8 publications
0
7
0
Order By: Relevance
“…Information Retrieval (IR) has a long and robust history in research, with numerous publicly accessible datasets developed to support its advancement. Some of the most well-known and widely-used ones are MS-MARCO [2], TREC [35,38,5], Common Crawl [32], and ClueWeb22 [24]. In response to the challenges of IR, various models and methods have been proposed, utilizing both classic ones such as Vector Space Model (VSM) [31], Latent Semantic Indexing (LSI) [6] and BM25, as well as more modern transformer-based models, such as RoBERTa [19], BERT [7], and T5 [27].…”
Section: Related Workmentioning
confidence: 99%
“…Information Retrieval (IR) has a long and robust history in research, with numerous publicly accessible datasets developed to support its advancement. Some of the most well-known and widely-used ones are MS-MARCO [2], TREC [35,38,5], Common Crawl [32], and ClueWeb22 [24]. In response to the challenges of IR, various models and methods have been proposed, utilizing both classic ones such as Vector Space Model (VSM) [31], Latent Semantic Indexing (LSI) [6] and BM25, as well as more modern transformer-based models, such as RoBERTa [19], BERT [7], and T5 [27].…”
Section: Related Workmentioning
confidence: 99%
“…BERTimbau used brWac (Wagner Filho et al, 2018), a 2.7B token dataset obtained from crawling PT-BR websites, while Albertina used a PT-PT filtered version of OS-CAR, together with PT-PT transcripts datasets from the Portuguese and the EU Parliaments (Hajlaoui et al, 2014;Koehn, 2005). Sabiá uses the Portuguese subset of ClueWeb22 (Overwijk et al, 2022). Leveraging filtered massive web crawls such as ClueWeb22 (Overwijk et al, 2022) and OSCAR (Abadji et al, 2022), the Portuguese webarchive (Gomes et al, 2008) (Arquivo.pt), encyclopedic and dialog data, we assemble and contribute with a large and highly-diverse pre-training PT-PT corpus.…”
Section: Large-scale Pt Text Datamentioning
confidence: 99%
“…Sabiá uses the Portuguese subset of ClueWeb22 (Overwijk et al, 2022). Leveraging filtered massive web crawls such as ClueWeb22 (Overwijk et al, 2022) and OSCAR (Abadji et al, 2022), the Portuguese webarchive (Gomes et al, 2008) (Arquivo.pt), encyclopedic and dialog data, we assemble and contribute with a large and highly-diverse pre-training PT-PT corpus.…”
Section: Large-scale Pt Text Datamentioning
confidence: 99%
See 1 more Smart Citation
“…The datasets identified tend to adopt Common Crawl licenses. It is worth mentioning the case of the Portuguese LLM Glória (Lopes et al, 2024), where the usage of the clueweb22 dataset (Overwijk et al, 2022) as part of the training corpus required Glória's authors to adopt this license as well.…”
Section: Nlp Licensing Systemmentioning
confidence: 99%