Proceedings of the Third Arabic Natural Language Processing Workshop 2017
DOI: 10.18653/v1/w17-1306
|View full text |Cite
|
Sign up to set email alerts
|

A Neural Architecture for Dialectal Arabic Segmentation

Abstract: The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 53 publications
(17 citation statements)
references
References 19 publications
0
17
0
Order By: Relevance
“…It has been noted that stemming using Farasa improved the supervised sentiment classification performance in TEC, TAC datasets ( Table 2, Table 4) where it achieved the second best F1-score (85.9%) in TAC outperforming the baseline by 18.6%. Although Farasa was trained with MSA corpora, it succeeded in identifying the affixes to be cut in Tunisian words because of the lexical overlap between MSA and Arabic dialects in general [17]. In order to retain the variety of words having same root and different meanings, we have also used light stemming.…”
Section: Resultsmentioning
confidence: 99%
“…It has been noted that stemming using Farasa improved the supervised sentiment classification performance in TEC, TAC datasets ( Table 2, Table 4) where it achieved the second best F1-score (85.9%) in TAC outperforming the baseline by 18.6%. Although Farasa was trained with MSA corpora, it succeeded in identifying the affixes to be cut in Tunisian words because of the lexical overlap between MSA and Arabic dialects in general [17]. In order to retain the variety of words having same root and different meanings, we have also used light stemming.…”
Section: Resultsmentioning
confidence: 99%
“…It does not generalize well to other domains [8]. It has been shown by [12] that sequence to sequence mapping models work well in segmenting dialectical Arabic. In this work, we use a modified version of this second method.…”
Section: Segmentation Methodsmentioning
confidence: 94%
“…The corpus contains about 200,000 words with POS categories, lemmatisation, English glosses and dialect identification annotations. Samih et al (2017) used a corpus, containing 350 tweets with more than 8000 words including 3000 words written in the Egyptian dialect. They manually annotated each word to provide: stem, lemma, CODA-compliant writing (Habash et al 2012a), segmentation and POS tags, as well as the corresponding MSA word, MSA segmentation and MSA POS tags.…”
Section: Morphological Analysis and Pos Taggingmentioning
confidence: 99%