Kurdish stemmer pre-processing steps for improving information retrieval

Mustafa, Arazo M.; Rashid, Tarik A.

doi:10.1177/0165551516683617

Cited by 20 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Sanad of hadith has been ignored, and we focused on the preprocessing of text. The first step of text classification is to convert the text into clear words format, then into a vector [26], [28], After that, identifying the most common words and the most informative features in the dataset of hadith.…”

Section: B Hadith Text Pre-processingmentioning

confidence: 99%

Classification of Hadith According to Its Content Based on Supervised Learning Algorithms

2019

View full text Add to dashboard Cite

Given the importance of the Prophet's Hadith for Muslims all over the world, where it is the second source of Islam after the Qur'an and the fundamental resource of legislation in the Islam community. This study is focused on the Classification of hadith automatically into different categories according to its content, based on Hadith text. The objective of this study is to build a classifier model can classify and differentiate hadith categories, to predict its topic like prayer, fasting, and zakat; using data mining and machine learning techniques. In this study, many supervised learning algorithms plus combination methods such as the stacking algorithm was used to improve classification accuracy. The best three classifiers were evaluated mainly: the Decision Tree (DT), Random Forest (RF), and Naïve Bayes (NB), which achieved higher accuracy reached up to 0.965%, 0.956, and 0.951% respectively. Also, Binary (Boolean algebra) and TF-IDF methods as term weighting was applied to determine the frequency of each word in the hadith text, and identify the most significant features in training dataset using Information Gain (IG), and Chi-square (CHI). The experimental results showed that retrain these classifiers after applying IG and CHI as features selection; gave better accuracy compared to the previous results. Additional to, the best classifier gave high accuracy was DT, it has achieved higher accuracy in most test cases whether in the Boolean algebra or TF-IDF because it can deal with missing values and identifying the most essential features from the training dataset, known as features engineering.

show abstract

Section: B Hadith Text Pre-processingmentioning

confidence: 99%

Classification of Hadith According to Its Content Based on Supervised Learning Algorithms

2019

View full text Add to dashboard Cite

show abstract

“…Significant cost reductions were also made by the system throughout the documentation and final approval of the reports in the imaging department. Kurdish stemmer pre-processing for improving information retrieval conducted by researcher in [13]. This article introduces the Kurdish stemming-step method.…”

Section: Related Workmentioning

confidence: 99%

“…Several studies have been done related to common languages such English [5], [6], Arabic [7]- [9], and Persian [10]- [12]. Moreover, there are few studies which are consummated regarding Kurdish language [13], [14], despite it, a huge gap can be seen in the case of Kurdish Kurmanji dialect; therefore, this study has been aimed to serve this gap due to Kurmanji dialect in the case of creating lemmatization and spell-checker with spell-correction system. Hence, in the future, this study can be used in several applications that include data translation, sentence retrieval, document retrieval, and also can be extend and upgrade to more powerful similar systems.…”

Section: Introductionmentioning

confidence: 99%

Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction

Mustafa

Nabi

2023

UHD J SCI TECH

View full text Add to dashboard Cite

There are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more specifically for Kurmanji dialect, which increased the need of creating such systems. Lemmatization is the process of determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors determine whether a word is correctly spelled also correct a range of spelling errors, respectively. This research aims to present a lemmatization and a word-level error correction system for Kurdish Kurmanji Dialect, which are the first tools for this dialect based on our knowledge. The proposed approach for lemmatization is built on morphological rules, and a hybrid approach that relies on the n-gram language model and the Jaccard Coefficient Similarity algorithm was applied to the spell-checker and spell-correction. The process results for lemmatization, as detailed in this article, rates of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. Furthermore, for spell-checker and spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained.

show abstract

“…Central Kurdish ( Sorani ) is one of two main dialects of the Kurdish language, it is generally thought that Sorani is spoken by about 9 to 10 million people in Iraq and Iran [ 1 , 2 ]. It is mainly written using a modified Arabic/Persian alphabet containing 34 characters, including characters that have been replaced in recent years like (ك) that's no longer been used by the Kurdish language and replaced with (ک).…”

Section: Data Descriptionmentioning

confidence: 99%

An extensive dataset of handwritten central Kurdish isolated characters

et al. 2021

View full text Add to dashboard Cite

To collect the handwritten format of separate Kurdish characters, each character has been printed on a grid of 14 × 9 of A4 paper. Each paper is filled with only one printed character so that the volunteers know what character should be written in each paper. Then each paper has been scanned, spliced, and cropped with a macro in photoshop to make sure the same process is applied for all characters. The grids of the characters have been filled mainly by volunteers of students from multiple universities in Erbil.

show abstract

Kurdish stemmer pre-processing steps for improving information retrieval

Cited by 20 publications

References 17 publications

Classification of Hadith According to Its Content Based on Supervised Learning Algorithms

Classification of Hadith According to Its Content Based on Supervised Learning Algorithms

Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction

An extensive dataset of handwritten central Kurdish isolated characters

Contact Info

Product

Resources

About