Topic segmentation is important for many natural language processing applications such as information retrieval, text summarization... In our work, we are interested in the topic segmentation of textual document. We present a survey of related works particularly C99 and TextTiling. Then, we propose an adaptation of these topic segmenters for textual document written in Arabic language named as ArabC99 and ArabTextTiling. For experimental results, we construct an Arabic corpus based on newspapers of different Arab countries. Finally, we evaluate the performance of these new segmenters by comparing them together and to related works using the metrics WindowDiff and F-measure.
This paper focuses on the topic identification for the Arabic language based on topic models. We study the Latent Dirichlet Allocation (LDA) as an unsupervised method for the Arabic topic identification. Thus, a deep study of LDA is carried out at two levels: Stemming process and the choice of LDA hyper-parameters. For the first level, we study the effect of different Arabic stemmers on LDA. For the second level, we focus on LDA hyper-parameters α and β and their impact on the topic identification. This study shows that LDA is an efficient method for Arabic topic identification especially with the right choice of hyper-parameters. Another important result is the high impact of the stemming algorithm on topic identification.
Cet article met l'accent sur l'identification thématique pour la langue arabe basée sur les topic models. Nous étudions l'Allocation de Dirichlet Latente (LDA) comme une méthode non supervisée pour l'identification thématique. Ainsi, une étude approfondie de LDA a été effectuée à deux niveaux: le processus de lemmatisation et le choix des hyper-paramètres. Pour le premier niveau, nous étudions l'effet des différents lemmatiseurs sur LDA. Pour le deuxième niveau, nous nous focalisons sur les hyper-paramètres α et β de LDA et leurs impacts sur l'identification. Cette étude montre que LDA est une méthode efficace pour l'identification thématique Arabe surtout avec le bon choix des hyper-paramètres. Un autre résultat important est l'impact élevé de l'algorithme de lemmatisation sur l'identification thématique.
Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable research gap in this field, as far as the Arabic language is concerned. The purpose of this article is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters: ArabC99 and ArabTextTiling. This study is carried out on two independent levels: the pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this article emphasizes the multilevel improvement of ATS based on Light10 and LSA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.