Investigating Input and Output Units in Diacritic Restoration

Alqahtani, Sawsan; Diab, Mona

doi:10.1109/icmla.2019.00142

“…Zalmout and Habash (2019a) obtained an additional boost in performance (0.3% improvement over ours) when they add a dialect variant of Arabic in the learning process, sharing information between both languages. Alqahtani and Diab (2019a) provides comparable performance to ALL and better performance on some task combinations in terms of WER on all and OOV words. The difference between their model and our BASE model is the addition of a CRF (Conditional Random Fields) layer which incorporate dependencies in the output space at the cost of model's computational efficiency (memory and speed).…”

Section: Input Representationmentioning

confidence: 97%

“…Maximum Entropy and Support Vector Machine) (Zitouni and Sarikaya, 2009;Pasha et al, 2014) or neural based approaches for different languages that include diacritics such as Arabic, Vietnamese, and Yoruba. Neural based approaches yield stateof-the-art performance for diacritic restoration by using Bidirectional LSTM or temporal convolutional networks (Zalmout and Habash, 2017;Orife, 2018;Alqahtani and Diab, 2019a).…”

Section: Related Workmentioning

confidence: 99%

“…As a preprocessing step, the words are converted to their constituents (e.g. morphemes, lemmas, or n-grams) and then diacritic restoration models are built on top of that (Ananthakrishnan et al, 2005;Alqahtani and Diab, 2019b). Ananthakrishnan et al (2005) use POS tags to improve diacritic restoration at the syntax level assuming that POS tags are known at inference time.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Multitask Learning Approach for Diacritic Restoration

Alqahtani

¹

,

Mishra

²

,

Diab

³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-ofthe-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data).

show abstract

“…Zalmout and Habash (2019a) obtained an additional boost in performance (0.3% improvement over ours) when they add a dialect variant of Arabic in the learning process, sharing information between both languages. Alqahtani and Diab (2019a) provides comparable performance to ALL and better performance on some task combinations in terms of WER on all and OOV words. The difference between their model and our BASE model is the addition of a CRF (Conditional Random Fields) layer which incorporate dependencies in the output space at the cost of model's computational efficiency (memory and speed).…”

Section: Input Representationmentioning

confidence: 97%

“…Maximum Entropy and Support Vector Machine) (Zitouni and Sarikaya, 2009;Pasha et al, 2014) or neural based approaches for different languages that include diacritics such as Arabic, Vietnamese, and Yoruba. Neural based approaches yield stateof-the-art performance for diacritic restoration by using Bidirectional LSTM or temporal convolutional networks (Zalmout and Habash, 2017;Orife, 2018;Alqahtani and Diab, 2019a).…”

Section: Related Workmentioning

confidence: 99%

A Multitask Learning Approach for Diacritic Restoration

Alqahtani¹,

Mishra²,

Diab³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-ofthe-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data). * * The work was conducted while the author was with AWS, Amazon AI.1 Diacritics are marks that are added above, below, or inbetween the letters to compose a new letter or characterize the letter with a different sound (Wells, 2000).

show abstract