2020
DOI: 10.48550/arxiv.2010.10906
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

German's Next Language Model

Abstract: In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 15 publications
0
12
0
Order By: Relevance
“…In addition, we list the best results that were achieved without a Transformer architecture (Riedl and Padó 2018). While our approach outperforms the BiLSTM results of Riedl and Padó (2018), we were not able to reach the results of Chan, Schweter, and Möller (2020). One of the reasons is that our BERTs are smaller and also pre-trained on a much smaller text corpus.…”
Section: Table 10mentioning
confidence: 91%
See 3 more Smart Citations
“…In addition, we list the best results that were achieved without a Transformer architecture (Riedl and Padó 2018). While our approach outperforms the BiLSTM results of Riedl and Padó (2018), we were not able to reach the results of Chan, Schweter, and Möller (2020). One of the reasons is that our BERTs are smaller and also pre-trained on a much smaller text corpus.…”
Section: Table 10mentioning
confidence: 91%
“…This means that only either all tokens belonging to a word are masked or none of the tokens. Recent work of Chan, Schweter, and Möller (2020) and Cui et al (2019) already showed the positive effect of WWM in pre-training on the performance of the downstream task. In this article, we also examine the differences between the original MLM task and the MLM task with WWM.…”
Section: Whole-word Masking (Wwm)mentioning
confidence: 99%
See 2 more Smart Citations
“…For experiments with fine-tuning, we use language-specific BERT models 11 for German (Chan et al, 2020), Spanish (Canete et al, 2020), Dutch (de Vries et al, 2019), Finnish (Virtanen et al, 2019, Danish 12 , Croatain (Ulčar and Robnik-Šikonja, 2020), while we use mBERT (Devlin et al, 2019) for Afrikaans.…”
Section: Experimental Protocolmentioning
confidence: 99%