Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.223
|View full text |Cite
|
Sign up to set email alerts
|

On Romanization for Model Transfer Between Scripts in Neural Machine Translation

Abstract: Transfer learning is a popular strategy to improve the quality of low-resource machine translation. For an optimal transfer of the embedding layer, the child and parent model should share a substantial part of the vocabulary. This is not the case when transferring to languages with a different script. We explore the benefit of romanization in this scenario. Our results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(16 citation statements)
references
References 22 publications
0
16
0
Order By: Relevance
“…6 Our findings are generally in line with previous work. Transliteration to English specifically (Lin et al, 2016;Durrani et al, 2014) and named entity transliteration (Kundu et al, 2018;Grundkiewicz and Heafield, 2018) has been proven useful for cross-lingual transfer in tasks like NER, entity linking (Rijhwani et al, 2019), morphological inflection (Murikinati et al, 2020), and Machine Translation (Amrhein and Sennrich, 2020).…”
Section: Transfer Via Transliterationmentioning
confidence: 99%
“…6 Our findings are generally in line with previous work. Transliteration to English specifically (Lin et al, 2016;Durrani et al, 2014) and named entity transliteration (Kundu et al, 2018;Grundkiewicz and Heafield, 2018) has been proven useful for cross-lingual transfer in tasks like NER, entity linking (Rijhwani et al, 2019), morphological inflection (Murikinati et al, 2020), and Machine Translation (Amrhein and Sennrich, 2020).…”
Section: Transfer Via Transliterationmentioning
confidence: 99%
“…Additionally, when comparing our findings and anticipated AMT results to other languages pairs such as Czech-English and vice-versa, we observe followings:  Although both languages; Arabic and Czech are morphologically-rich [68], have free word order, and share most of the technical challenges, we observe with regard to the latest WMT20 findings [128], a constant improvement in the baseline systems for Czech MT in contrast to AMT.  Romanized-based models [106] , [129] reduce vocabulary sizes and UNK rates and achieve better or comparable translation results compared with their counterparts in Arabic under various segmentation scenarios. While the advantage of romanization at the subword level is that Latin encoding provides great flexibility in extracting proper BPE rules during segmentation, further reducing rare words and improving translation quality.…”
Section: A Observationsmentioning
confidence: 98%
“…The main fo cus of these libraries is script conversion and ro manization. In this capacity they were success fully employed in diverse downstream multilin gual NLP tasks such as neural machine transla tion (Zhang et al, 2020; Amrhein andSennrich, 2020), morphological analysis (Hauer et al, 2019; Murikinati et al, 2020, named entity recogni tion (Huang et al, 2019) and partofspeech tag ging (Cardenas et al, 2019).…”
Section: Related Workmentioning
confidence: 99%