Proceedings of the Third Conference on Machine Translation: Research Papers 2018
DOI: 10.18653/v1/w18-6303
|View full text |Cite
|
Sign up to set email alerts
|

Neural Machine Translation of Logographic Language Using Sub-character Level Information

Abstract: Recent neural machine translation (NMT) systems have been greatly improved by encoder-decoder models with attention mechanisms and sub-word units. However, important differences between languages with logographic and alphabetic writing systems have long been overlooked. This study focuses on these differences and uses a simple approach to improve the performance of NMT systems utilizing decomposed sub-character level information for logographic languages.Our results indicate that our approach not only improves… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
14
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 41 publications
(15 citation statements)
references
References 15 publications
1
14
0
Order By: Relevance
“…On general test sets, we see BLEU degradation compared to the baseline, especially for Japanese-English. We note that our Japanese-English ASPEC decomposed-training score is similar to the result for the same set achieved by Zhang and Komachi (2018) with ideograph decomposition. However, our non-decomposed baseline is much stronger, and so we are not able to replicate their finding that training with subcharacter decomposition is beneficial to NMT from logographic languages to English.…”
Section: Training With Decompositionsupporting
confidence: 77%
See 2 more Smart Citations
“…On general test sets, we see BLEU degradation compared to the baseline, especially for Japanese-English. We note that our Japanese-English ASPEC decomposed-training score is similar to the result for the same set achieved by Zhang and Komachi (2018) with ideograph decomposition. However, our non-decomposed baseline is much stronger, and so we are not able to replicate their finding that training with subcharacter decomposition is beneficial to NMT from logographic languages to English.…”
Section: Training With Decompositionsupporting
confidence: 77%
“…We first explore the impact of two variations on ideograph-based sub-character decomposition applied to all characters in the source language. Following Zhang and Komachi (2019) we use decomposition information from the CHISE project 2 , which provides ideograph sequences for CJK (Chinese-Japanese-Korean) characters.…”
Section: Training With Sub-character Decompositionmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, the translation accuracy was improved by preprocessing of the bilingual data. Zhang and Komachi (2018) demonstrated that higher translation accuracy can be obtained by decomposing Kanji into ideographic characters and strokes in Japanese-Chinese NMT. Stratos (2017) proposed a speech-parsing model for South Korean with character-level tokenization and decomposition into phonemes, demonstrating an improvement in the speech-parsing accuracy.…”
Section: Related Workmentioning
confidence: 99%
“…They proved that neural machine translation can be done directly on a sequence of characters without any explicit word segmentation. Zhang and Komachi (2018) proposed a subcharacter level translation for Japanese and Chinese in which Kanji in Japanese and characters in Chinese are decomposed into ideographs or strokes. However, this approach will increase sequence length a lot and need an extra dictionary to decompose Kanji and Chinese characters into strokes or ideographs, Costa-jussà and Fonollosa (2016) used convolution layers followed by multiple high-way layers to generate character-based word embedding.…”
Section: Introductionmentioning
confidence: 99%