BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Ljubešić, Nikola; Lauc, Davor

doi:10.48550/arxiv.2104.09243

Cited by 4 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This same motivation has also led to a transformer model that exclusively uses languages of the Slavic genus. The BERTić language model (Ljubešić and Lauc, 2021) was trained from scratch for Bosnian, Croatian, Montenegrin and Serbian. Whereas the CroSloEngual model uses more distant languages, BERTić selected these languages because they are very closely related, are mutually intelligible and because they are considered part of the same Serbo-Croatian macro language (according to the ISO 639-3 Macrolanguage Mappings).…”

Section: Related Researchmentioning

confidence: 99%

Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?

Singh,

Maladry,

Lefever

2023

Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

View full text Add to dashboard Cite

This paper investigates whether adding data of typologically closer languages improves the performance of transformer-based models for three different downstream tasks, namely Partof-Speech tagging, Named Entity Recognition, and Sentiment Analysis, compared to a monolingual and plain multilingual language model. For the presented pilot study, we performed experiments for the use case of Slovene, a low(er)-resourced language belonging to the Slavic language group. The experiments were carried out in a controlled setting, where a monolingual model for Slovene was compared to combined language models containing Slovene, trained with the same amount of Slovene data. The experimental results show that adding typologically closer languages indeed improves the performance of the Slovene language model, and even succeeds in outperforming the large multilingual XLM-RoBERTa model for NER and PoStagging. We also reveal that, contrary to intuition, distant or unrelated languages also combine admirably with Slovene, often outperforming XLM-R as well. All the bilingual models used in the experiments are publicly available. 1

show abstract

Section: Related Researchmentioning

confidence: 99%

Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?

Singh,

Maladry,

Lefever

2023

Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

View full text Add to dashboard Cite

show abstract

“…The first, and, at the time of writing of this paper, the only transformerbased language model specifically trained for Serbian, Croatian, Bosnian, and Montenegrin is BERTić (Ljubešić and Lauc, 2021). BERTić is trained using the ELECTRA approach (Clark et al, 2020) for training transformer models.…”

Section: Language Modelsmentioning

confidence: 99%

“…The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin, or Serbian. The collection was used to train the BERTić transformer model (Ljubešić and Lauc, 2021). The Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian, and Slovenian Wikipedias were collected in the comparable corpus CLASSLA-Wikipedia (CLASSLA-Wiki, Table 2).…”

Section: Multilingual Corporamentioning

confidence: 99%

“…A sentiment analysis approach was applied at the sentence level. The results of classification four of the transformer models were compared: FastText (Bojanowski et al, 2017) with pre-trained CLARIN.SI word embeddings (Ljubešić and Erjavec, 2018), XLM-R , CroSloEngual BERT (Ulčar and Robnik-Šikonja, 2020), and BERTić (Ljubešić and Lauc, 2021). BERTić gave the best results compared to the others (model macro F1 0.7941 ± 0.0101).…”

Section: Sentiment Analysismentioning

confidence: 99%

“…The transformer-based model was also introduced for several tasks in Serbian, Croatian, and Slovene, including NER (Ljubešić and Lauc, 2021). The model was pre-trained on web-crawled texts in Serbian, Bosnian, Croatian, and Slovene, consisting of 8 billion tokens, and then fine-tuned for NER on several openly available datasets, such as SETimes.SR , corpora of news articles, or ReLDI-sr (Ljubešić et al, 2017), corpora of annotated tweets.…”

Section: Named Entity Recognitionmentioning

confidence: 99%

See 2 more Smart Citations

Creating a stop word dictionary in Serbian

Marovac¹,

Avdić²,

Ljajić³

2021

Sci Pub Univ Novi Pazar Ser A

View full text Add to dashboard Cite

By using natural language processing techniques, it is possible to get a lot of information from the extraction of document topics through mapping of document key words or content-based classification of documents, etc. To get this information, an important step is to separate words that carries informative value in a sentence from those words that do not affect its meaning. By using dictionaries of stop words specific to each natural language, the marking of words that do not carry meaning in the sentence is achieved. This paper presents creating a stop word dictionary in Serbian. The influence of stop words to the text processing is presented on three different data set. It is shown that by using proposed dictionary of Serbian stop words the data set dimension is reduced from 15% to 39%, while the quality of the obtained n-gram language models is improved.

show abstract

Multilingual Transformer and BERTopic for Short Text Topic Modeling: The Case of Serbian

Medvecki,

Bašaragin,

Ljajić

et al. 2024

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Cited by 4 publications

References 8 publications

Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?

Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?

Creating a stop word dictionary in Serbian

Multilingual Transformer and BERTopic for Short Text Topic Modeling: The Case of Serbian

Contact Info

Product

Resources

About