Text classification dataset and analysis for Uzbek language

Kuriyozov, Elmurod; Salaev, Ulugbek; Matlatipov, Sanatbek; Matlatipov, Gayrat

doi:10.48550/arxiv.2302.14494

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…English-based news datasets such as 20 Newsgroups 1 , Reuters-21578 2 , and RCV1 3 comprise thousands of articles collected from several news websites, news magazines, and newsletters (newspapers) to create the different news corpora. Similarly, news datasets in Arabic (ALJ-news dataset [41]), Uzbek [42], Urdu (Urdu-news [43]), and South African (Setswana and Sepedi-news [28]) languages were also collected from a cross-section of news portals. For our study, the Ewe news dataset was collected from popular news portals, which include Ghana News 4 , Voice of Africa 5 , Togo First 6 , Punch News 7 , BBC-Africa 8 , My Joy News 9 , and Citi News 10 .…”

Section: Data Collectionmentioning

confidence: 99%

“…The portals were selected to represent various categories, such as politics, coronavirus, sports, business, entertainment, and local news. The news articles were automatically extracted using the open-source Python library Beautiful Soup 11 , as in [29,42]. Eight native speakers of the Ewe language are invited to label the dataset simultaneously.…”

Section: Data Collectionmentioning

confidence: 99%

See 1 more Smart Citation

Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language

Agbesi,

Chen,

Yussif

et al. 2023

Systems

View full text Add to dashboard Cite

Despite a few attempts to automatically crawl Ewe text from online news portals and magazines, the African Ewe language remains underdeveloped despite its rich morphology and complex "unique" structure. This is due to the poor quality, unbalanced, and religious-based nature of the crawled Ewe texts, thus making it challenging to preprocess and perform any NLP task with current transformer-based language models. In this study, we present a well-preprocessed Ewe dataset for low-resource text classification to the research community. Additionally, we have developed an Ewe-based word embedding to leverage the low-resource semantic representation. Finally, we have fine-tuned seven transformer-based models, namely BERT-based (cased and uncased), DistilBERT-based (cased and uncased), RoBERTa, DistilRoBERTa, and DeBERTa, using the preprocessed Ewe dataset that we have proposed. Extensive experiments indicate that the fine-tuned BERT-base-cased model outperforms all baseline models with an accuracy of 0.972, precision of 0.969, recall of 0.970, loss score of 0.021, and an F1-score of 0.970. This performance demonstrates the model’s ability to comprehend the low-resourced Ewe semantic representation compared to all other models, thus setting the fine-tuned BERT-based model as the benchmark for the proposed Ewe dataset.

show abstract