Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.23
|View full text |Cite
|
Sign up to set email alerts
|

SwahBERT: Language Model of Swahili

Abstract: The rapid development of social networks, electronic commerce, mobile Internet, and other technologies has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens' opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) has become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…Tanvir et al (2021) similarly show that an Estonian-specific BERT outperforms multilingual variants in five out of seven tasks. Likewise, Martin et al (2022) find that a BERT variant trained ground-up on a Swahili dataset outperforms multilingual models. BERTić, a variant trained on Bosnian, Croatian, Montenegrin and Serbian, also outperforms both mBERT and a trilingual Croatian, Slovene and English BERT in nearly every task (Ljubešić and Lauc, 2021).…”
Section: Introductionmentioning
confidence: 78%
“…Tanvir et al (2021) similarly show that an Estonian-specific BERT outperforms multilingual variants in five out of seven tasks. Likewise, Martin et al (2022) find that a BERT variant trained ground-up on a Swahili dataset outperforms multilingual models. BERTić, a variant trained on Bosnian, Croatian, Montenegrin and Serbian, also outperforms both mBERT and a trilingual Croatian, Slovene and English BERT in nearly every task (Ljubešić and Lauc, 2021).…”
Section: Introductionmentioning
confidence: 78%
“…In addition to supporting the creation of NMT models (discussed in the proceeding section), our datasets have the potential to serve as a foundation for many other NLP tasks beyond translation. We believe that these datasets will be a valuable resource for the study of South African government communication, and that it can be used for direct creation of multilingual document categorisation/classification (Schwenk and Li, 2018), simplification Siddharthan, 2014;Martin et al, 2022), entity extraction (Tedeschi et al, 2021;Chen et al, 2018;Pappu et al, 2017;Emelyanov and Artemova, 2019), and other NLP tasks. To further extend the dataset's usefulness, we recommend looking at work such as the Parallel Meaning Bank (Abzianidze et al, 2017), which can act as an inspiration for transferring knowledge from one language to another and provide new benchmarks that may be helpful for Southern African languages beyond South Africa.…”
Section: Creation Of Za-gov-multilingualmentioning
confidence: 99%
“…Recently, researchers have developed corpora for question answering and emotion classification in Swahili (Martin et al, 2022). In addition, a transformer model for Swahili has been developed (Martin et al, 2022).…”
Section: Related Work: Nlp Approaches For Swahilimentioning
confidence: 99%