Dense Information Flow for Neural Machine Translation

Shen, Yanyao; Tan, Xu; He, Di; Qin, Tao; Liu, Tie-Yan

doi:10.18653/v1/n18-1117

Cited by 32 publications

(23 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neural Machine Translation Given the bilingual translation pair (x, y), an NMT model learns the parameter θ by maximizing the loglikelihood log P (y|x, θ). The encoder-decoder framework (Bahdanau et al, 2015;Luong et al, 2015b;Sutskever et al, 2014;Wu et al, 2016;Gehring et al, 2017;Vaswani et al, 2017;Shen et al, 2018;) is adopted to model the conditional probability P (y|x, θ), where the encoder maps the input to a set of hidden representations h and the decoder generates each target 1 Many-to-many translation can be bridged through manyto-one and one-to-many translations. Our methods can be also extended to the many-to-many setting with some modifications.…”

Section: Introductionmentioning

confidence: 99%

Multilingual Neural Machine Translation with Language Clustering

Tan¹,

Chen²,

He³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

167

163

View full text Add to dashboard Cite

Multilingual neural machine translation (NMT), which translates multiple languages using a single model, is of great practical importance due to its advantages in simplifying the training process, reducing online maintenance costs, and enhancing low-resource and zero-shot translation. Given there are thousands of languages in the world and some of them are very different, it is extremely burdensome to handle them all in a single model or use a separate model for each language pair. Therefore, given a fixed resource budget, e.g., the number of models, how to determine which languages should be supported by one model is critical to multilingual NMT, which, unfortunately, has been ignored by previous work. In this work, we develop a framework that clusters languages into different groups and trains one multilingual model for each cluster. We study two methods for language clustering: (1) using prior knowledge, where we cluster languages according to language family, and (2) using language embedding, in which we represent each language by an embedding vector and cluster them in the embedding space. In particular, we obtain the embedding vectors of all the languages by training a universal neural machine translation model. Our experiments on 23 languages show that the first clustering method is simple and easy to understand but leading to suboptimal translation accuracy, while the second method sufficiently captures the relationship among languages well and improves the translation accuracy for almost all the languages over baseline methods.

show abstract

Section: Introductionmentioning

confidence: 99%

Multilingual Neural Machine Translation with Language Clustering

Tan¹,

Chen²,

He³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

167

163

View full text Add to dashboard Cite

show abstract

“…Recent studies show that different encoder layers capture linguistic properties of different levels (Peters et al, 2018), and aggregating layers is of profound value to better fuse semantic information (Shen et al, 2018;Dou et al, 2018;Dou et al, 2019). We assume that different decoder layers may value different levels of information i.e.…”

Section: Inputmentioning

confidence: 99%

Learning Source Phrase Representations for Neural Machine Translation

Xu¹,

Genabith²,

Xiong³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

The Transformer translation model (Vaswani et al., 2017) based on a multi-head attention mechanism can be computed effectively in parallel and has significantly pushed forward the performance of Neural Machine Translation (NMT). Though intuitively the attentional network can connect distant words via shorter network paths than RNNs, empirical analysis demonstrates that it still has difficulty in fully capturing long-distance dependencies (Tang et al., 2018). Considering that modeling phrases instead of words has significantly improved the Statistical Machine Translation (SMT) approach through the use of larger translation blocks ("phrases") and its reordering ability, modeling NMT at phrase level is an intuitive proposal to help the model capture long-distance relationships. In this paper, we first propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations. In addition, we incorporate the generated phrase representations into the Transformer translation model to enhance its ability to capture long-distance relationships. In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline, which shows the effectiveness of our approach. Our approach helps Transformer Base models perform at the level of Transformer Big models, and even significantly better for long sentences, but with substantially fewer parameters and training steps. The fact that phrase representations help even in the big setting further supports our conjecture that they make a valuable contribution to long-distance relations.

show abstract

“…Concerning natural language processing, Peters et al (2018) have found that combining different layers is helpful and their model significantly improves state-of-the-art models on various tasks. Researchers have also explored fusing information for NMT models and demonstrate aggregating layers is also useful for NMT (Shen et al 2018;Wang et al 2018;Dou et al 2018). However, all of these works mainly focus on static aggregation in that their aggregation strategy is independent of specific hidden states.…”

Section: Related Workmentioning

confidence: 99%

“…Fusing information across layers for deep NMT models, however, has received substantially less attention. A few recent studies reveal that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for natural language processing tasks (Peters et al 2018;Shen et al 2018;Wang et al 2018;Dou et al 2018). However, their methods mainly focus on static aggregation in that the aggregation mechanisms are the same across different positions in the sequence.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Dou

Wang

et al. 2019

AAAI

View full text Add to dashboard Cite

With the promising progress of deep neural networks, layer aggregation has been used to fuse information across layers in various fields, such as computer vision and machine translation. However, most of the previous methods combine layers in a static fashion in that their aggregation strategy is independent of specific hidden states. Inspired by recent progress on capsule networks, in this paper we propose to use routing-by-agreement strategies to aggregate layers dynamically. Specifically, the algorithm learns the probability of a part (individual layer representations) assigned to a whole (aggregated representations) in an iterative way and combines parts accordingly. We implement our algorithm on top of the state-of-the-art neural machine translation model TRANSFORMER and conduct experiments on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation datasets. Experimental results across language pairs show that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model.

show abstract

Dense Information Flow for Neural Machine Translation

Cited by 32 publications

References 12 publications

Multilingual Neural Machine Translation with Language Clustering

Multilingual Neural Machine Translation with Language Clustering

Learning Source Phrase Representations for Neural Machine Translation

Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Contact Info

Product

Resources

About