CC-News-En

Mackenzie, Joel; Benham, Rodger; Petri, Matthias; Trippas, Johanne R.; Culpepper, J. Shane; Moffat, Alistair

doi:10.1145/3340531.3412762

Cited by 37 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike BERT, RoBERTa underwent pretraining using an expanded dataset, comprising of five English-language corpora that totaled over 160 GB of uncompressed text. These corpora include BOOKCORPUS [29], WIKIPEDIA, CC-NEWS [21], OPENWEBTEXT [11], STORIES [25].…”

Section: Discussionmentioning

confidence: 99%

Using Natural Language Processing as a Scalable Mental Status Evaluation Technique

Wagner,

Jagayat,

Kumar

et al. 2023

Preprint

View full text Add to dashboard Cite

Mental health is in a state of crisis with demand for mental health services significantly surpassing available care. As such, building scalable and objective measurement tools for mental health evaluation is of primary concern. Given the usage of spoken language in diagnostics and treatment, it stands out as potential methodology. Here a model is built for mental health status evaluation using natural language processing. Specifically, a RoBERTa-based model is fine-tuned on text from psychotherapy sessions to predict mental health status with prediction accuracy on par with clinical evaluations at 74%.

show abstract

Section: Discussionmentioning

confidence: 99%

Using Natural Language Processing as a Scalable Mental Status Evaluation Technique

Wagner,

Jagayat,

Kumar

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Tiedemann [22] presented OPUS, an extensive freely available parallel corpus encompassing over 200 languages with tools for exploration and integration, enhancing research and development in linguistic studies. Initiatives such as Mackenzie et al's [24] creation of the CC-News-En corpus from the Common Crawl Foundation data mitigated the shortage of journalism corpora to an extent. To clarify corpus evaluation, Lefer's [25] chapter on Parallel Corpora in "A Practical Handbook of Corpus Linguistics" outlined the main features of parallel corpora.…”

Section: Corpus Constructionmentioning

confidence: 99%

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Zhang,

Su,

Tian

et al. 2024

Electronics

View full text Add to dashboard Cite

This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication.

show abstract

“…We pre-train our models with a combination of publicly available text corpora, viz. BookCorpus (BookC) (Zhu et al, 2015), Wikipedia English (Wiki), OpenWebText (OWT) (Gokaslan & Cohen, 2019), and CC-News (CCN) (Mackenzie et al, 2020). We borrow most training hyperparameters from RoBERTa.…”

Section: B2 Model Trainingmentioning

confidence: 99%

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

Tuli

Dedhia²,

Tuli

et al. 2023

jair

View full text Add to dashboard Cite

The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. However, training such models and exploring their hyperparameter space is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, such works limit analysis to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model has 2.6× smaller size. FlexiBERT-Large, another proposed model, attains state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.

show abstract

CC-News-En

Cited by 37 publications

References 41 publications

Using Natural Language Processing as a Scalable Mental Status Evaluation Technique

Using Natural Language Processing as a Scalable Mental Status Evaluation Technique

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

Contact Info

Product

Resources

About