Common Swahili Slangs

Masasi, Noel; Masua, Bernard

doi:10.17632/b8tc96xf3h.1

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Swahili Stop Words dataset containing 254 unique words [3], common Swahili Slangs dataset containing 234 words for slang and their respective Swahili proper words [4] and common Swahili Typos dataset containing 431 misspelled words their respective Swahili proper words [5].…”

Section: Research Developed and Contributed Commonmentioning

confidence: 99%

See 1 more Smart Citation

The Impact of Applying Different Pre-Processing Techniques on Swahili Textual Data Using Doc2Vec

Masua,

Masasi,

Maziku

et al. 2023

NLPRE

View full text Add to dashboard Cite

Data pre-processing is an important step in machine learning text classification as it improves data quality and hence improves performance of trained algorithms. We experimentally compare the following pre-processing techniques: punctuation removal, lowercasing, typos replacement, slang replacement and stop-word removal on a Swahili short message service (SMS) dataset for topic classification. Different machine learning algorithms are applied such as Random Forest, Stochastic Gradient Descent, RNN LSTM Unidirectional, RNN LSTM Bidirectional and Support Vector Machine. We analyze the impact of the pre-processing techniques on classification accuracy and f1-score. Our experiments show that all pre-processing steps, when applied separately, have a positive impact on the performance of all evaluated classification algorithms. Among all experimented pre-processing steps, stop-word removal has the highest impact on performance of both accuracy and f1-score metrics. Also, of all evaluated algorithms, Random Forest and Stochastic Gradient Descent are the most positively affected with pre-processing steps.

show abstract

Section: Research Developed and Contributed Commonmentioning

confidence: 99%

“…Performing replacement of slang means that words with similar meaning are merged to have clean token information and reduce dimensionality. This study applies a slang dataset from [4] to replace slang with proper words.…”

Section: Replacing Common Swahili Slangmentioning

confidence: 99%