Noel Masasi scite author profile

Noel Masasi

4Publications

2Citation Statements Received

5Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Dar es Salaam

Publications

Order By: Most citations

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Masua

Masasi

2020

Data in Brief

View full text Add to dashboard Cite

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.

show abstract

Common Swahili Stop-words

Masasi¹,

Masua²

2020

View full text Add to dashboard Cite

Common Swahili Slangs

Masasi¹,

Masua²

2020

View full text Add to dashboard Cite

Common Swahili Typos

Masasi¹,

Masua²

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Noel Masasi

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Common Swahili Stop-words

Common Swahili Slangs

Common Swahili Typos

Contact Info

Product

Resources

About