Arabic tokenization system

Attia, Mohammed

doi:10.3115/1654576.1654588

Cited by 71 publications

(32 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Splitting sentences into tokens is the purpose of tokenization. It also enables them to end up being given into POS tagger or a morphological analyzer for further processing (Attia, 2007).…”

Section: Corpus and Pre-processingmentioning

confidence: 99%

Arabic Part of Speech Tagging Using K-Nearest Neighbour and Naive Bayes Classifiers Combination

Mahafdah

Omar

Al-Omari

2014

Journal of Computer Science

View full text Add to dashboard Cite

Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w 0 (the current word), p 0 (POS of the current word), p -3 (POS of three words before), p -2 (POS of two words before) and p -1 (POS of the word before).

show abstract

Section: Corpus and Pre-processingmentioning

confidence: 99%

Arabic Part of Speech Tagging Using K-Nearest Neighbour and Naive Bayes Classifiers Combination

Mahafdah

Omar

Al-Omari

2014

Journal of Computer Science

View full text Add to dashboard Cite

show abstract

“…One of the most useful features in detecting boundaries of sentences and tokens is punctuation marks. The number of punctuation marks and symbols used in Arabic corpus is 134 [31]. There are several methods to implement tokenization; the simplest way we used is extracting any alphanumeric string between two white spaces.…”

Section: Methodsmentioning

confidence: 99%

Combining Different Approaches to Improve Arabic Text Documents Classification

Abuhaiba¹,

Dawoud²

2017

IJISA

View full text Add to dashboard Cite

Abstract-The objective of this research is to improve Arabic text documents classification by combining different classification algorithms. To achieve this objective we build four models using different combination methods.The first combined model is built using fixed combination rules, where five rules are used; and for each rule we used different number of classifiers. The best classification accuracy, 95.3%, is achieved using majority voting rule with seven classifiers, and the time required to build the model is 836 seconds.The second combination approach is stacking, which consists of two stages of classification. The first stage is performed by base classifiers, and the second by a meta classifier. In our experiments, we used different numbers of base classifiers and two different meta classifiers: Naï ve Bayes and linear regression. Stacking achieved a very high classification accuracy, 99.2% and 99.4%, using Naï ve Bayes and linear regression as meta classifiers, respectively. Stacking needed a long time to build the models, which is 1963 seconds using naï ve Bayes and 3718 seconds using linear regression, since it consists of two stages of learning.The third model uses AdaBoost to boost a C4.5 classifier with different number of iterations. Boosting improves the classification accuracy of the C4.5 classifier; 95.3%, using 5 iterations, and needs 1175 seconds to build the model, while the accuracy is 99.5% using 10 iterations and requires 1966 seconds to build the model.The fourth model uses bagging with decision tree. The accuracy is 93.7% achieved in 296 seconds when using 5 iterations, and 99.4% when using 10 iteration requiring 471 seconds. We used three datasets to test the combined models: BBC Arabic, CNN Arabic, and OSAC datasets. The experiments are performed using Weka and RapidMiner data mining tools. We used a platform of Intel Core i3 of 2.2 GHz CPU with 4GB RAM.The results of all models showed that combining classifiers can effectively improve the accuracy of Arabic text documents classification.

show abstract

“…an Arabic token may consist of several lexical items which have their own meaning and POS. For example; according to Attia (2007) an Arabic verb can comprise up to four clitics: conjunction, tense, stems with affixes and object pronouns as shown in the following example.…”

Section: The Tokenizer Modulementioning

confidence: 99%

“…Compared to what has been done in English and other languages, there is only one approach that has been investigated for Arabic shallow parsing Diab et al (2004;2007; . Diab et al (2004) performed tokenization, POS tagging and used an SVM-based approach for Arabic text chunking.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Rule Based Shallow Parser for Arabic Language

Mohammed

Omar

2011

Journal of Computer Science

View full text Add to dashboard Cite

Problem statement: One of language processing approaches that compute a basic analysis of sentence structure rather than attempting full syntactic analysis is shallow syntactic parsing. It is an analysis of a sentence which identifies the constituents (noun groups, verb groups, prepositional groups), but does not specify their internal structure, nor their role in the main sentence. The only technique used for Arabic shallow parser is Support Vector Machine (SVM) based approach. The problem faced by shallow parser developers is the boundary identification which is applied to ensure the generation of high accuracy system performance. Approach: The specific objective of the research was to identify the entire Noun Phrases (NPs), Verb Phrases (VPs) and Prepositional Phrases (PPs) boundaries in the Arabic language. This study discussed various idiosyncrasies of Arabic sentences to derive more accurate rules to detect start and the end boundaries of each clause in an Arabic sentence. New rules were proposed to the shallow parser features up to the generation of two levels from full parse-tree. We described an implementation and evaluate the rule-based shallow parser that handles chunking of Arabic sentences. This research was based on a critical analysis of the Arabic sentences architecture. It discussed various idiosyncrasies of Arabic sentences to derive more accurate rules to detect the start and the end boundaries of each clause in an Arabic sentence. Results: The system was tested manually on 70 Arabic sentences which composed of 1776 words, with the length of the sentences between 4-50 words. The result obtained was significantly better than state of the art Arabic published results, which achieved F-scores of 97%. Conclusion: The main achievement includes the development of Arabic shallow parser based on rule-based approaches. Chunking which constitutes the main contribution is achieved on two successive stages that include grouped sequences of adjacent words on the basis of linguistic properties.

show abstract

Arabic tokenization system

Cited by 71 publications

References 12 publications

Arabic Part of Speech Tagging Using K-Nearest Neighbour and Naive Bayes Classifiers Combination

Arabic Part of Speech Tagging Using K-Nearest Neighbour and Naive Bayes Classifiers Combination

Combining Different Approaches to Improve Arabic Text Documents Classification

Rule Based Shallow Parser for Arabic Language

Contact Info

Product

Resources

About