Stemming Malay Text and Its Application in Automatic Text Categorization

Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

doi:10.1587/transinf.e92.d.2351

Cited by 11 publications

(1 citation statement)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Othman [17] and Ahmad [18] algorithms are the most pioneering rule-based Malay stemmers. Even though there are plenty of rule-based stemming approaches for Malay that have been improved by the previous researchers since then, they still suffer from affixation errors, including over-stemming, understemming, unchanged, and spelling exceptions [19], [20], [21]. The major causes of this stemming error are the affix removal method, the similarity of the root word with the affixation www.ijacsa.thesai.org word, and exception rules in prefixation and confixation [22], [23].…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study of Stemming Techniques on the Malay Text

Mohemad,

Muhait,

Noor

et al. 2023

IJACSA

View full text Add to dashboard Cite

Text stemming, an essential preprocessing step in the development of Natural Language Processing (NLP) applications, involves the transformation of various word forms into their root words. Stemming plays a critical role in decreasing the volume of text, thereby enhancing the efficiency of various computational tasks such as information retrieval, text classification, and text clustering. Stemming is a rule-based approach. On the other hand, it frequently suffers affixation errors that result in under-stemming, over-stemming, or both, as well as unstemmed or spelling exceptions. Every language has different stemming techniques, and among the most well-known Malay stemming algorithms are the Othman and Ahmad algorithms. Therefore, this study aims to compare the performance of the stemming errors between the Othman and Ahmad algorithms in stemming Malay text, particularly on two different domains of textual datasets, which are the course summaries of the education domain and housebreaking crime reports of the crime domain. The Othman algorithm presents a set of 121 stemming rules (set A). In the meantime, Ahmad's algorithm proposes two distinct sets of stemming rules, comprising 432 (set B) and 561 rules (set C), respectively. Based on the experiment results with 100 course summaries, the Ahmad algorithm (Set B) obtained a higher accuracy rate of 93.61%. The second highest is the Ahmad algorithm (Set C) with 93.53%. The Othman algorithm achieved the lowest accuracy with 86.04% compared to the other two algorithms. Meanwhile, findings from the experiment with 100 housebreaking crime reports show similar results, with the Ahmad algorithm (Set C) achieving the highest stemming accuracy of approximately 93.80% and the Othman algorithm producing the lowest stemming accuracy (83.09%). The result indicates that stemming accuracy is consistent across different types of datasets.

show abstract