Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage 2019
DOI: 10.1145/3322905.3322927
|View full text |Cite
|
Sign up to set email alerts
|

Arabic-SOS

Abstract: While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…The topic models are then built through stems produced by the Arabic-SOS tools, where the Gradient Boosting Machine Learning algorithm is used for word segmentation, resulting in an accuracy rate of 98.8%. Next, the Mallet topic modeling toolkit (Mohamed and Sayyed 2019) within these models, topical words, having maximally higher frequency rates than others, are taken to be thematically correlating into clusters (themes). Through 50 and 100-topic modeling, we get a thematic distribution map, where the high-probability list co-concurrences (or near collocates) guide our close investigation of the distribution of such top words and reading of the selected texts to see how the surfacing topics cohere into meaningful patterns.…”
Section: Methodsmentioning
confidence: 99%
“…The topic models are then built through stems produced by the Arabic-SOS tools, where the Gradient Boosting Machine Learning algorithm is used for word segmentation, resulting in an accuracy rate of 98.8%. Next, the Mallet topic modeling toolkit (Mohamed and Sayyed 2019) within these models, topical words, having maximally higher frequency rates than others, are taken to be thematically correlating into clusters (themes). Through 50 and 100-topic modeling, we get a thematic distribution map, where the high-probability list co-concurrences (or near collocates) guide our close investigation of the distribution of such top words and reading of the selected texts to see how the surfacing topics cohere into meaningful patterns.…”
Section: Methodsmentioning
confidence: 99%
“…For the purposes of morphological segmentation, we use the Arabic-SOS package [17], which is specialized in Classical and pre-Modern Arabic. Arabic-SOS reports an accuracy of 99.5%, and we can confirm this very high accuracy on the corpus used in this article.…”
Section: A Note On the Arabic Languagementioning
confidence: 99%