Although deep neural networks have recently led to great achievements in machine translation (MT), various challenges are still encountered during the development of Korean-Vietnamese MT systems. Because Korean is a morphologically rich language and Vietnamese is an analytic language, neither have clear word boundaries. The high rate of homographs in Korean causes word ambiguities, which causes problems in neural MT (NMT). In addition, as a low-resource language pair, there is no freely available, adequate Korean-Vietnamese parallel corpus that can be used to train translation models. In this paper, we manually established a lexical semantic network for the special characteristics of Korean as a knowledge base that was used for developing our Korean morphological analysis and word-sense disambiguation system: UTagger. We also constructed a large Korean-Vietnamese parallel corpus, in which we applied the state-of-the-art Vietnamese word segmentation method RDRsegmenter to Vietnamese texts and UTagger to Korean texts. Finally, we built a bi-directional Korean-Vietnamese NMT system based on the attentionbased encoder-decoder architecture. The experimental results indicated that UTagger and RDRsegmenter could significantly improve the performance of the Korean-Vietnamese NMT system, achieving remarkable results by 27.79 BLEU points and 58.77 TER points in Korean-to-Vietnamese direction and 25.44 BLEU points and 58.72 TER points in the reverse direction. INDEX TERMS Korean-Vietnamese machine translation, Korean-Vietnamese parallel corpus, lexical semantic network, morphological analysis, neural machine translation, word sense disambiguation. I. INTRODUCTION Neural machine translation based on the attention-based encoder-decoder model [1], [2] has emerged as the dominant paradigm in MT. It has achieved state-of-the-art performance in the translation of language pairs that have large amounts of training parallel corpora, such as English-French [3] and English-German [4]. However, it has shown poor translation quality in low-resource language pairs where training parallel corpora are scarce [5], [6]. Korean-Vietnamese is a low-resource language pair, and Korean-Vietnamese MT systems need to be built to serve The associate editor coordinating the review of this manuscript and approving it for publication was Yang Zhen.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.