Arabic natural language processing: An overview

Guellil, Imane; Saadane, Houda; Azouaou, Faiçal; Gueni, Billel; Nouvel, Damien

doi:10.1016/j.jksuci.2019.02.006

Cited by 106 publications

(66 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In other words, standard normalization systems of document length have traditionally ignored the issue of language peculiarities which has negative impacts on the validity and thus reliability of such methods. In natural language processing (NLP) of Arabic, the specific linguistic properties play a significant role in the success of NLP applications [5][6][7][8]. It is essential for NLP systems thus to consider the peculiarities of Arabic for more reliable results.…”

Section: Statement Of the Problemmentioning

confidence: 99%

Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods

Omar¹,

Hamouda²

2020

IJACSA

View full text Add to dashboard Cite

This article is concerned with addressing the effect of document length variation on measuring the semantic similarity in the text clustering of news in Arabic. Despite the development of different approaches for addressing the issue, there is no one strong conclusion recommending one approach. Furthermore, many of these have not been tested for the clustering of news in Arabic. The problem is that different length normalization methods can yield different analyses of the same data set, and that there is no obvious way of selecting the best one. The choice of an inappropriate method, however, has negative impacts on the accuracy and thus the reliability of clustering performance. Given the lack of agreement and disparity of opinions, we set out to comprehensively evaluate the existing normalization techniques to prove empirically which one is the best for the normalization of text length to improve the text clustering performance of news in Arabic. For this purpose, a corpus of 693 stories representing different categories and of different lengths is designed. Data is analyzed using different document length normalization methods along with vector space clustering (VSC), and then the analysis on which the clustering structure agrees most closely with the bibliographic information of the news stories is selected. The analysis of the data indicates that the clustering structure based on the byte length normalization method is the most accurate one. One main problem, however, with this method is that the lexical variables within the data set are not ranked which makes it difficult for retaining only the most distinctive lexical features for generating clustering structures based on semantic similarity. As thus, the study proposes the integration of TF-IDF for ranking the words within all the documents so that only those with the highest TF-IDF values are retained. It can be finally concluded that the proposed model proved effective in improving the function of the byte normalization method and thus on the performance and reliability of news clustering in Arabic. The findings of the study can also be extended to IR applications in Arabic. The proposed model can be usefully used in supporting the performance of the retrieval systems of Arabic in finding the most relevant documents for a given query based on semantic similarity, not document length.

show abstract

Section: Statement Of the Problemmentioning

confidence: 99%

Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods

Omar¹,

Hamouda²

2020

IJACSA

View full text Add to dashboard Cite

show abstract

“…Many scientific papers have tackled the classification of Arabic morphology, and several reviews exist of the most cited Arabic morphological analyzers [4][5][6][7][8][9]. These studies have many shortcomings, including the following: 1) They are very general in their classification process, and most existing analyzers are classified under one category, "linguistic".…”

Section: Classification Of Arabic Morphological Analysis Techniquesmentioning

confidence: 99%

Arabic Morphological Analysis Techniques

Alothman¹,

Al-Salman²

2020

IJACSA

View full text Add to dashboard Cite

Recently, activity surrounding Arabic natural language processing has increased significantly. Morphological analysis is the basis of most tasks related to Arabic natural language processing. There are many scientific studies on Arabic morphological analysis, yet most of them lack an accurate classification of Arabic morphology and fail to cover both recent and traditional techniques. This paper aims to survey Arabic morphological analysis techniques from 2005 to 2019 and to organize them into a reasonable and expandable classification system. To facilitate and support new research, this paper compares the currently available Arabic morphological analyzers, reaches certain conclusions, and proposes some promising directions for future research in Arabic morphological analysis.

show abstract

“…Furthermore, here, words are often constructed in a complex manner. Moreover, it may contain affixes, agglutinative, and drop features [ 3 , 4 , 5 , 7 ].…”

Section: Introductionmentioning

confidence: 99%

Preprocessing Arabic text on social media

Hegazi

Al-Dossari

Al-Yahy

et al. 2021

Heliyon

View full text Add to dashboard Cite

Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

show abstract

Arabic natural language processing: An overview

Cited by 106 publications

References 58 publications

Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods

Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods

Arabic Morphological Analysis Techniques

Preprocessing Arabic text on social media

Contact Info

Product

Resources

About