2021
DOI: 10.1145/3434235
|View full text |Cite
|
Sign up to set email alerts
|

Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

Abstract: Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 36 publications
0
5
0
Order By: Relevance
“…Recurrent long short-term memory (LSTM) networks [12] have been proven to be suitable tools for learning the task entirely from data without using manually designed features [13,14,15]. Their combination with conditional random field (CRF) and the extension to sequence-to-sequence modeling [16] help the model performance [17,18].…”
Section: Related Workmentioning
confidence: 99%
“…Recurrent long short-term memory (LSTM) networks [12] have been proven to be suitable tools for learning the task entirely from data without using manually designed features [13,14,15]. Their combination with conditional random field (CRF) and the extension to sequence-to-sequence modeling [16] help the model performance [17,18].…”
Section: Related Workmentioning
confidence: 99%
“…The accuracy of the model is measured as the percentage of correct predictions of the Arabic syntactic diacritics (i.e., the last character of the stem of each word). We evaluated the trained models using the complement of the accuracy which is called the case-ending error rate (CEER) [ 3 ].…”
Section: Methodsmentioning
confidence: 99%
“…Darwish et al [ 3 ] combined a set of character-level features such as stem and part-of-speech (POS) tags with an embedding layer and a bidirectional LSTM. The proposed system restores core word and case ending diacritization.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations