Findings of the Association for Computational Linguistics: ACL 2022 2022
DOI: 10.18653/v1/2022.findings-acl.135
|View full text |Cite
|
Sign up to set email alerts
|

Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects

Abstract: We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(21 citation statements)
references
References 21 publications
0
21
0
Order By: Relevance
“…These thoughts are valid to a reasonable extent. However, recent studies demonstrated that training a language understanding model on a larger corpora/dataset might not necessarily imply improving its performance [62], [63]. Besides, our intention here is not to use ATSC to train a language understating model but rather to use it as a benchmarking corpus for testing the generalization of a pre-trained test-to-text generation model.…”
Section: Discussion and Potential Threats To Validitymentioning
confidence: 99%
“…These thoughts are valid to a reasonable extent. However, recent studies demonstrated that training a language understanding model on a larger corpora/dataset might not necessarily imply improving its performance [62], [63]. Besides, our intention here is not to use ATSC to train a language understating model but rather to use it as a benchmarking corpus for testing the generalization of a pre-trained test-to-text generation model.…”
Section: Discussion and Potential Threats To Validitymentioning
confidence: 99%
“…The syntactic component of the AraDiaWER assigns morphological and lexical tags to the reference and predicted transcripts. This has been achieved by utilizing a BERT-based disambiguation model (Inoue et al, 2021) out of the box, which uses a pre-trained CAMeLBERT-Mix language model to classify the morphological and lexical features of an input sequence. Firstly, we use CAMeLBERT-Mix LM to determine the syntactic tag.…”
Section: Syntactic Componentmentioning
confidence: 99%
“…Given the inconsistencies in some of the ARZATB entries, we used a version of ARZATB that was automatically synchronized with a combination of EGY and MSA analyzers as our reference. This version was reported on in previous publications (Pasha et al, 2014;Zalmout and Habash, 2019;Inoue et al, 2022). For recall evaluation, we also use CAMELMORPH EGY and MSA together in a similar manner, with preference towards EGY if an imperfect (i.e., not all analysis features match) tie is reached.…”
Section: Egy Recall Evaluation and Error Analysismentioning
confidence: 99%
“…The work on Arabic computational morphology has led to the development of many resources that directly model morphology (e.g., analyzers, generators) and also resources and tools that use them (Maamouri et al, 2004;Pasha et al, 2014). Morphological analyzers have consistently shown that they are still valuable components in the NLP toolbox, even as the latter increasingly shifts toward the neural modeling space, and especially in low-resource and dialectal settings (Zalmout and Habash, 2017;Baly et al, 2017;Inoue et al, 2022).…”
Section: Introductionmentioning
confidence: 99%