Arabic natural language processing (ANLP) consists of developing techniques and tools that can utilize and analyze the Arabic language in both written and spoken contexts. ANLP makes an important contribution to many existing developed systems. It provides Arabic and non-Arabic speakers with helpful and convenient tools that can be used in different domains. Modern ANLP tools are developed using machine learning (ML) techniques. ML algorithms are widely used in NLP because of their high accuracy rate regardless of the robustness of the data that is used and because of the ease with which they can be implemented. On the other hand, the methodology of ANLP applications based on ML involves several distinct phases. It is, therefore, crucial to recognize and understand these phases in detail as well as the most widely used ML algorithms. This survey discusses this concept in detail, shows the involvement of ML techniques in developing such tools, and identifies well-known techniques used in ANLP. Moreover, this survey discusses the characteristics and complexity of the Arabic language in addition to the importance and needs of ANLP.INDEX TERMS Arabic natural language processing, classification, feature selection, machine learning.
In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.