Proceedings of the Fourth Arabic Natural Language Processing Workshop 2019
DOI: 10.18653/v1/w19-4615
|View full text |Cite
|
Sign up to set email alerts
|

Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan

Abstract: We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(8 citation statements)
references
References 9 publications
0
8
0
Order By: Relevance
“…NLP has several difficulties when working with dialectal Arabic. Arabic has a rich temporal affixal and inflectional morphology with several classes of attachable clitics (Al-Shargi et al, 2016). Arabic is the official language and is widely spoken in Yemeni official institutions (Al-Hamzi, 2021).…”
Section: The Yemeni Linguistic Situationmentioning
confidence: 99%
“…NLP has several difficulties when working with dialectal Arabic. Arabic has a rich temporal affixal and inflectional morphology with several classes of attachable clitics (Al-Shargi et al, 2016). Arabic is the official language and is widely spoken in Yemeni official institutions (Al-Hamzi, 2021).…”
Section: The Yemeni Linguistic Situationmentioning
confidence: 99%
“…A corpus of 200K tokens was morphologically annotated covering seven different Arabic dialects including Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi, and Moroccan (Alshargi et al, 2019). The GUMAR Emirati corpus consists of 200K tokens collected from novels.…”
Section: Dialectal Arabic Resourcesmentioning
confidence: 99%
“…Dialectal Arabic (DA) content dominates informal writings in emails, social media, blogs, and social messaging. Interest in building computational resources for Arabic dialects has been in the rise to provide both (i) annotated corpora (Jarrar et al, 2022b;Alshargi et al, 2019;Bouamor et al, 2018;Jarrar et al, 2017;Al-Shargi et al, 2016;Zribi et al, 2015;Jarrar et al, 2014) and (ii) morphological dialect analyzers (Obeid et al, 2020;Pasha et al, 2014;Zribi et al, 2017;Abdul-Mageed et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…Al-Shargi et al . (2016) presented morphologically annotated corpora for Moroccan and Sanaani Yemeni Arabic. The corpora data were collected from both online and print materials such as internet comments, forums, oral interviews, folktales, sermons, textbooks, blogs and Facebook posts.…”
Section: Nlp Resources For Arabic Dialectsmentioning
confidence: 99%