Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W 2021
DOI: 10.18653/v1/2021.eacl-srw.24
|View full text |Cite
|
Sign up to set email alerts
|

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

Abstract: We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new registerannotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we sh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
18
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(20 citation statements)
references
References 22 publications
2
18
0
Order By: Relevance
“…3 The micro average F1-score was a good 68% on average. Similar to previous studies (Repo et al, 2021;Biber and Egbert, 2016), we observed a large variation among classes, ranging from an F1-score of 44% (Informational persuasion) to 81% (Lyrical). Our method was able to extract stable keywords for all the classes expect for Spoken, where no keyword candidate passed the selection frequency threshold.…”
Section: Predictive Performancesupporting
confidence: 90%
See 1 more Smart Citation
“…3 The micro average F1-score was a good 68% on average. Similar to previous studies (Repo et al, 2021;Biber and Egbert, 2016), we observed a large variation among classes, ranging from an F1-score of 44% (Informational persuasion) to 81% (Lyrical). Our method was able to extract stable keywords for all the classes expect for Spoken, where no keyword candidate passed the selection frequency threshold.…”
Section: Predictive Performancesupporting
confidence: 90%
“…As a classifier, we use the XLM-R deep language model (Conneau et al, 2020) because of its strong ability to model multiple languages, both in monolingual and cross-lingual settings. We opt for the base size rather than the large, due to its relatively frugal use of resource and comparable predictive performance on CORE (Repo et al, 2021). The task is modeled as a multilabel classification task using a sequence classification head, binary crossentry with sigmoid loss and a fixed prediction threshold.…”
Section: Classifier and Attribution Methodsmentioning
confidence: 99%
“…In this regard, further studies are needed to develop resources for Web registers in languages other than English. Indeed, recent studies suggest that cross-lingual modeling of Web registers is achievable (Repo et al, 2021;Rönnqvist et al, 2021), but it is unclear to what extent cultural differences affect the representation of Web registers and even the very existence of specific registers. Therefore, in the future, extending the research of Web registers to a widely multilingual setting would be greatly beneficial.…”
Section: General Discussion and Conclusionmentioning
confidence: 99%
“…For instance, Multilingual BERT (Devlin et al, 2018) and XLM-R (Conneau et al, 2020) provide cross-lingual language models that can be fine-tuned to model data in multilingual settings. These have also been applied successfully to register identification (Repo et al, 2021;Rönnqvist et al, 2021). Another line of modeling endeavor has focused on smaller and faster models, such as DistilBERT, which have been created to overcome the computational challenges related to the original version (Sanh et al, 2019).…”
Section: Pretrained Language Models and Transfer Learningmentioning
confidence: 99%
See 1 more Smart Citation