ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD

Al-Hajj, Moustafa; Jarrar, Mustafa

doi:10.26615/978-954-452-072-4_005

Cited by 16 publications

(19 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We plan to increase the size of our corpus to cover additional Levantine sub-dialects, especially those of other Levantine areas, most notably some of Syria's dialectal varieties. We also plan to use this corpus to develop morphological analyzers and word-sense disambiguation system for Levantine Arabic as we did for MSA (see (Al-Hajj and Jarrar, 2021a;Al-Hajj and Jarrar, 2021b)). Additionally, we plan to build on the Palestinian and Lebanese dialect lemmas to develop a Levantine-MSA-English Lexicon and extend it with synonyms (Jarrar et al, 2021).…”

Section: Discussionmentioning

confidence: 99%

Curras + Baladi: Towards a Levantine Corpus

Haff¹,

Jarrar²,

Hammouda³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents two-fold contributions: a full revision of the Palestinian morphologically annotated corpus (Curras), and a newly annotated Lebanese corpus (Baladi). Both corpora can be used as a more general Levantine corpus. Baladi consists of around 9.6K morphologically annotated tokens. Each token was manually annotated with several morphological features and using LDC's SAMA lemmas and tags. The inter-annotator evaluation on most features illustrates 78.5% Kappa and 90.1% F1-Score. Curras was revised by refining all annotations for accuracy, normalization and unification of POS tags, and linking with SAMA lemmas. This revision was also important to ensure that both corpora are compatible and can help to bridge the nuanced linguistic gaps that exist between the two highly mutually intelligible dialects. Both corpora are publicly available through a web portal.

show abstract

Section: Discussionmentioning

confidence: 99%

Curras + Baladi: Towards a Levantine Corpus

Haff¹,

Jarrar²,

Hammouda³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…WSD is the most common task, which aims to disambiguate word's semantics. Given a context (i.e., sentence), a target word in the context, and a set of candidate senses (i.e., glosses, meaning definitions (Jarrar, 2006)) for the target word, the goal of the WSD task is to determine which of these senses is the intended meaning for the target word (Al-Hajj and Jarrar, 2022). For example, the word ( ǧdāwl ) has two senses in Arabic: tables ( ) and creek (…”

Section: Introductionmentioning

confidence: 99%

“…Such semantic understanding tasks have been challenging for many years, but recently gained attention due to the advances in contextualized word embedding models Jarrar, 2022, 2021). Language models, specially BERT (Kenton and Toutanova, 2019), have made significant advancements in down-streaming NLP tasks.…”

Section: Introductionmentioning

confidence: 99%

“…It can be fine-tuned on domain/task-specific data (e.g., POS tagging, WSD, TSV, and WiC) to update its contextualized embeddings. The TSV task has been addressed by fine-tuning BERT on context-gloss pairs as a sentence pair binary classification problem (Huang et al, 2019;Yap et al, 2020;Patel et al, 2021;Ranjbar and Zeinali, 2021;Lin and Giambi, 2021;El-Razzaz et al, 2021;Al-Hajj and Jarrar, 2022). However, the TSV task, similar to most NLP tasks, suffers from the knowledge-gain bottleneck, i.e., the lack of available quality datasets to train machine learning models.…”

Section: Introductionmentioning

confidence: 99%

“…Arabic is a low-resourced language (Darwish et al, 2021; and the only available context-gloss pairs dataset is ArabGlossBERT (Al-Hajj and Jarrar, 2022). It consists of 167K context-gloss pairs, a relatively small dataset for fine-tuning BERT on a TSV task.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Context-Gloss Augmentation for Improving Arabic Target Sense Verification

Malaysha¹,

Jarrar²,

Khalilia³

2023

Preprint

View full text Add to dashboard Cite

Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).

show abstract