Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models-in all languages except English-very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
Abstract. The Earth System Model Evaluation Tool (ESMValTool) is a community
diagnostics and performance metrics tool designed to improve comprehensive
and routine evaluation of Earth system models (ESMs) participating in the
Coupled Model Intercomparison Project (CMIP). It has undergone rapid
development since the first release in 2016 and is now a well-tested tool
that provides end-to-end provenance tracking to ensure reproducibility. It
consists of (1) an easy-to-install, well-documented Python package providing the
core functionalities (ESMValCore) that performs common preprocessing
operations and (2) a diagnostic part that includes tailored diagnostics and
performance metrics for specific scientific applications. Here we describe
large-scale diagnostics of the second major release of the tool that
supports the evaluation of ESMs participating in CMIP Phase 6 (CMIP6).
ESMValTool v2.0 includes a large collection of diagnostics and performance
metrics for atmospheric, oceanic, and terrestrial variables for the mean
state, trends, and variability. ESMValTool v2.0 also successfully reproduces
figures from the evaluation and projections chapters of the
Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report
(AR5) and incorporates updates from targeted analysis packages, such as the
NCAR Climate Variability Diagnostics Package for the evaluation of modes of
variability, the Thermodynamic Diagnostic Tool (TheDiaTo) to evaluate the
energetics of the climate system, as well as parts of AutoAssess that
contains a mix of top–down performance metrics. The tool has been fully
integrated into the Earth System Grid Federation (ESGF) infrastructure at
the Deutsches Klimarechenzentrum (DKRZ) to provide evaluation results from
CMIP6 model simulations shortly after the output is published to the CMIP
archive. A result browser has been implemented that enables advanced
monitoring of the evaluation results by a broad user community at much
faster timescales than what was possible in CMIP5.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.