Tokenization is very important in natural language processing. It can be seen as a preparation stage for all other natural language processing tasks. In this paper we propose a hybrid unsupervised method for Arabic tokenization system, considered as a stand-alone problem. After getting words from sentences by segmentation, we used the author's analyzer to produce all possible tokenizations for each word. Then, written rules and statistical methods are applied to solve the ambiguities. The output is one tokenization for each word. The statistical method was trained using 29k words, manually tokenized (data available from http://www.mimuw.edu.pl\~aliwy) from Al-Watan 2004 corpus (available from http://sites.google.com/site/mouradabbas9/corpora). The final accuracy was 98.83%.
Abstract. Text classification (TC) is an essential field in both text mining (TM) and natural language processing (NLP).Humans have a tendency to organize and categorize everything as they want to make things easier to understand. Therefore, text classification is an important step to achieve this goal. Arabic text classification (ATC) is a difficult process because the Arabic language has complications and limitations resulting from the nature of its morphology. In this paper, a proposed approach called the Master-Slaves technique (MST) was used to improve Arabic text classification. It consists of two main phases: in the first phase, a new Arabic corpus of 16757 text files was collected. These text files were classified into five categories manually. In the second phase, four different classifiers were implemented on the collected corpus. These classifiers are Naïve Bayes (NB), K-Nearest Neighbour (KNN), Multinomial Logistic Regression (MLR) and Maximum Weight (MW). Naïve Bayes classifier was implemented as Master and the others as Slaves. The results of these slave classifiers were used to change the probability of the Naïve Bayes classifier (Master). The four classifiers used were implemented individually and the simple voting technique was implemented among them too on the collected corpus to check the effectiveness and efficiency of the proposed technique. All the tests were applied after the pre-processing of Arabic text documents (tokenization, stemming, and stop-word removal) and each document was represented as vector of weights. For the reliability of the results, 10-fold cross-validation was used in this paper. The results showed that the Master-slaves technique gives a good improvement in accuracy of text document classification with accepted algorithm complexity compared to other techniques.
<span>The arabic sign language (ArSL) is the natural language of the deaf community in Arabic countries. ArSL suffers from a lack of resources such as unified dictionaries and corpora. In this work, a dictionary of Arabic language to ArSL has been constructed as a part of a translation system. The Arabic words are converted into hamburg notation system (HamNoSys) using eSign editor Software. HamNoSys was used to create manual parameters (handshape, hand orientation, hand location, and hand movement), while non-manual parameters (facial expressions, shoulder raising, mouthing gesture, head tilting, and body movement) added by using (mouth, face, and limbs) in the eSign editor software. The sign then converted to the sign gesture markup language (SiGML) file, and later 3D avatar interprets the SiGML file scripts to the animated sign. The constructed dictionary has three thousand signs; therefore, it can be adopted for the translation system in which written text can be transformed into sign language and can be utilized for the education of deaf people. The dictionary will be available as a free resource for researchers. It is hard and time-consuming work, but it is an essential step in machine translation of whole Arabic text to ArSL with 3D animations. </span>
The process of identifying the correct sense of a given word in a particular sentence is called Word Sense Disambiguation (WSD). It is complex problem because it involves drawing knowledge from various sources. Significant amount of effort has been put into resolving this problem in machine learning since its inception but the toil is still ongoing. Many techniques were used in WSD and implemented on different corpora for almost all languages. In this paper, WSD algorithms were classified to three categories as Knowledge-based, supervised and unsupervised techniques. Each category will be studied in details with explanation of almost all the algorithms in each category. Hence work examples for each method were taken with the used language, the used corpora and other factors. The benefits and drawback of each method were recorded. Some of these techniques have limitations in some situations, therefore this work will helps the researchers in the field of natural language processing to select the suitable algorithms to solve their particular problem in WSD. The novelty of the work can be seen in the comparison of the used works and the used algorithms. From this work, it was concluded that (i) some methods give high accuracy for language but low for other, (ii) the size of the used data set affects the performance of the used algorithm, (iii) some of these approaches can be run fastly but with limitation of the accuracy and (iv) most of these approaches are implemented for many languages successfully.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.