2021
DOI: 10.48550/arxiv.2104.09243
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Nikola Ljubešić,
Davor Lauc

Abstract: In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of partof-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR -a translation of the Choice of Plausible Alternatives (COPA) dataset into Cr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
1
1
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 8 publications
0
5
0
Order By: Relevance
“…This same motivation has also led to a transformer model that exclusively uses languages of the Slavic genus. The BERTić language model (Ljubešić and Lauc, 2021) was trained from scratch for Bosnian, Croatian, Montenegrin and Serbian. Whereas the CroSloEngual model uses more distant languages, BERTić selected these languages because they are very closely related, are mutually intelligible and because they are considered part of the same Serbo-Croatian macro language (according to the ISO 639-3 Macrolanguage Mappings).…”
Section: Related Researchmentioning
confidence: 99%
“…This same motivation has also led to a transformer model that exclusively uses languages of the Slavic genus. The BERTić language model (Ljubešić and Lauc, 2021) was trained from scratch for Bosnian, Croatian, Montenegrin and Serbian. Whereas the CroSloEngual model uses more distant languages, BERTić selected these languages because they are very closely related, are mutually intelligible and because they are considered part of the same Serbo-Croatian macro language (according to the ISO 639-3 Macrolanguage Mappings).…”
Section: Related Researchmentioning
confidence: 99%
“…The first, and, at the time of writing of this paper, the only transformerbased language model specifically trained for Serbian, Croatian, Bosnian, and Montenegrin is BERTić (Ljubešić and Lauc, 2021). BERTić is trained using the ELECTRA approach (Clark et al, 2020) for training transformer models.…”
Section: Language Modelsmentioning
confidence: 99%
“…The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin, or Serbian. The collection was used to train the BERTić transformer model (Ljubešić and Lauc, 2021). The Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian, and Slovenian Wikipedias were collected in the comparable corpus CLASSLA-Wikipedia (CLASSLA-Wiki, Table 2).…”
Section: Multilingual Corporamentioning
confidence: 99%
See 2 more Smart Citations