Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management 2020
DOI: 10.1145/3340531.3412762
|View full text |Cite
|
Sign up to set email alerts
|

CC-News-En

Abstract: We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four time… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(4 citation statements)
references
References 41 publications
0
4
0
Order By: Relevance
“…Unlike BERT, RoBERTa underwent pretraining using an expanded dataset, comprising of five English-language corpora that totaled over 160 GB of uncompressed text. These corpora include BOOKCORPUS [29], WIKIPEDIA, CC-NEWS [21], OPENWEBTEXT [11], STORIES [25].…”
Section: Discussionmentioning
confidence: 99%
“…Unlike BERT, RoBERTa underwent pretraining using an expanded dataset, comprising of five English-language corpora that totaled over 160 GB of uncompressed text. These corpora include BOOKCORPUS [29], WIKIPEDIA, CC-NEWS [21], OPENWEBTEXT [11], STORIES [25].…”
Section: Discussionmentioning
confidence: 99%
“…Tiedemann [22] presented OPUS, an extensive freely available parallel corpus encompassing over 200 languages with tools for exploration and integration, enhancing research and development in linguistic studies. Initiatives such as Mackenzie et al's [24] creation of the CC-News-En corpus from the Common Crawl Foundation data mitigated the shortage of journalism corpora to an extent. To clarify corpus evaluation, Lefer's [25] chapter on Parallel Corpora in "A Practical Handbook of Corpus Linguistics" outlined the main features of parallel corpora.…”
Section: Corpus Constructionmentioning
confidence: 99%
“…We pre-train our models with a combination of publicly available text corpora, viz. BookCorpus (BookC) (Zhu et al, 2015), Wikipedia English (Wiki), OpenWebText (OWT) (Gokaslan & Cohen, 2019), and CC-News (CCN) (Mackenzie et al, 2020). We borrow most training hyperparameters from RoBERTa.…”
Section: B2 Model Trainingmentioning
confidence: 99%