Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper proposes an efficient and accurate POS Tagging technique for Arabic language using hybrid approach. Due to the ambiguity issue, Arabic Rule-Based method suffers from misclassified and unanalyzed words. To overcome these two problems, we propose a Hidden Markov Model (HMM) integrated with Arabic Rule-Based method. Our POS tagger generates a set of three POS tags: Noun, Verb, and Particle. The proposed technique uses the different contextual information of the words with a variety of the features which are helpful to predict the various POS classes. To evaluate its accuracy, the proposed method has been trained and tested with two corpora: the Holy Quran Corpus and Kalimat Corpus for undiacritized Classical Arabic language. The experiment results demonstrate the efficiency of our method for Arabic POS Tagging. In fact, the obtained accuracies rates are 97.6%, 96.8% and 94.4% for respectively our Hybrid Tagger, HMM Tagger and for the Rule-Based Tagger with Holy Quran Corpus. And for Kalimat Corpus we obtained 94.60%, 97.40% and 98% for respectively Rule-Based Tagger, HMM Tagger and our Hybrid Tagger.
Text pre-processing of Arabic Language is a challenge and crucial stage in Text Categorization (TC)particularly and Text Mining (TM) generally. Stemming algorithms can be employed in Arabic text pre-processing to reduces words to their stems/or root.Arabic stemming algorithms can be ranked, accordingto three category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statisticalapproach (ex. N-Garm).However, no stemming of this language is perfect: The existing stemmers have asmall efficiency.In this paper, in order to improve the accuracy ofstemming and therefore the accuracy of our proposedTC system, an efficient hybrid method is proposed for stemming Arabic text. The effectiveness of theaforementioned four methods was evaluated and compared in term of the F-measure of the Naïve Bayesianclassifier and the Support Vector Machine classifier used in our TC system. The proposed stemmingalgorithm was found to supersede the other stemmingones: The obtained results illustrate that using theproposed stemmer enhances greatly the performance of Arabic Text Categorizatio
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.