ClueWeb22: 10 Billion Web Documents with Rich Information

Overwijk, Arnold; Xiong, Chenyan; Callan, Jamie

doi:10.1145/3477495.3536321

Cited by 22 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Information Retrieval (IR) has a long and robust history in research, with numerous publicly accessible datasets developed to support its advancement. Some of the most well-known and widely-used ones are MS-MARCO [2], TREC [35,38,5], Common Crawl [32], and ClueWeb22 [24]. In response to the challenges of IR, various models and methods have been proposed, utilizing both classic ones such as Vector Space Model (VSM) [31], Latent Semantic Indexing (LSI) [6] and BM25, as well as more modern transformer-based models, such as RoBERTa [19], BERT [7], and T5 [27].…”

Section: Related Workmentioning

confidence: 99%

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

Vitor¹,

Lotufo²,

Nogueira³

2023

Preprint

View full text Add to dashboard Cite

This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the biggest contribution of this study is the finding that despite the mT5 model being fine-tuned only on query-document pairs of the same language it proved to be viable for CLIR tasks, where query-document pairs are in different languages, even in the presence of suboptimal first-stage retrieval performance. The results of the study show outstanding performance across all tasks and languages, leading to a high number of winning positions. Finally, this study provides valuable insights into the use of mT5 in CLIR tasks and highlights its potential as a viable solution. For reproduction refer to https://github.com/unicamp-dl/ NeuCLIR22-mT5

show abstract

Section: Related Workmentioning

confidence: 99%

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

Vitor¹,

Lotufo²,

Nogueira³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…BERTimbau used brWac (Wagner Filho et al, 2018), a 2.7B token dataset obtained from crawling PT-BR websites, while Albertina used a PT-PT filtered version of OS-CAR, together with PT-PT transcripts datasets from the Portuguese and the EU Parliaments (Hajlaoui et al, 2014;Koehn, 2005). Sabiá uses the Portuguese subset of ClueWeb22 (Overwijk et al, 2022). Leveraging filtered massive web crawls such as ClueWeb22 (Overwijk et al, 2022) and OSCAR (Abadji et al, 2022), the Portuguese webarchive (Gomes et al, 2008) (Arquivo.pt), encyclopedic and dialog data, we assemble and contribute with a large and highly-diverse pre-training PT-PT corpus.…”

Section: Large-scale Pt Text Datamentioning

confidence: 99%

“…Sabiá uses the Portuguese subset of ClueWeb22 (Overwijk et al, 2022). Leveraging filtered massive web crawls such as ClueWeb22 (Overwijk et al, 2022) and OSCAR (Abadji et al, 2022), the Portuguese webarchive (Gomes et al, 2008) (Arquivo.pt), encyclopedic and dialog data, we assemble and contribute with a large and highly-diverse pre-training PT-PT corpus.…”

Section: Large-scale Pt Text Datamentioning

confidence: 99%

“…To gather high-quality, large-scale, PT language resources, we resorted to multiple PT-PT text sources, summarized in Table 1. OSCAR-2201 (Abadji et al, 2022) and ClueWeb-L 22 (Overwijk et al, 2022) are web crawls -they both give us text from blogs, forums, among other websites. The PTWiki 3 provides our model with well-written and reviewed encyclopedic knowledge, in neutral and revised Portuguese text.…”

Section: Pt Language Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

Exploring BERT for Aspect Extraction in Portuguese Language

Lopes

Corrêa

Freitas

2021

FLAIRS

View full text Add to dashboard Cite

Sentiment Analysis is the computer science field that comprises techniques that aim to automatically extract opinions from texts. Usually, these techniques assign a Sentiment Orientation to the whole document (Document Level Sentiment Analysis). But a document can express sentiment about several aspects of an entity. Methods that extract those aspects, paired with the sentiment about them, operate in the Aspect Level. Aspect-Based Sentiment Analysis approaches can be split into two stages: Aspect Extraction and Aspect Sentiment Classification. The literature presents works mainly focused on reviews about hotels, smartphones, or restaurants. In this work, we present an approach for Aspect Extraction based on Multilingual (Google's) and Portuguese (BERTimbau) BERT pre-trained models. Our experiments show that Aspect Extraction based on BERT pre-trained for Portuguese achieved Balanced Accuracy of up to 93% on a corpus of reviews about the accommodation sector.

show abstract

“…The datasets identified tend to adopt Common Crawl licenses. It is worth mentioning the case of the Portuguese LLM Glória (Lopes et al, 2024), where the usage of the clueweb22 dataset (Overwijk et al, 2022) as part of the training corpus required Glória's authors to adopt this license as well.…”

Section: Nlp Licensing Systemmentioning

confidence: 99%

Materiais arqueológicos da Cividade de Bagunte presentes no Museu de História Natural e da Ciência da Universidade do Porto

Almeida¹,

Almeida²,

Morais³

et al. 2020

Portugalia

View full text Add to dashboard Cite

The Cividade of Bagunte is a big fortified settlement from the Iron Age Period, which was widely transformed during the Romanization process of the Northwest of the Iberian Peninsula. The urban reorganization of the town happened in the 2nd half of the first century, with the Flavian dynasty. It had several archaeological campaigns during the 19th, 20th and 21th centuries. If the archaeological material dug in recent campaigns is in storage, the vast majority of the material collected in the 20 th century isn´t accounted for, exception made for the Torques made of silver. As for the archaeological artefacts gathered in the 19 th century, al are store at the University of Porto-Faculty of Sciences, being the subject of this article.

show abstract

ClueWeb22: 10 Billion Web Documents with Rich Information

Cited by 22 publications

References 8 publications

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

Exploring BERT for Aspect Extraction in Portuguese Language

Materiais arqueológicos da Cividade de Bagunte presentes no Museu de História Natural e da Ciência da Universidade do Porto

Contact Info

Product

Resources

About