Imitation Learning for Non-Autoregressive Neural Machine Translation

Wei, Bingzhen; Wang, Mingxuan; Zhou, Hao; Lin, Junyang; Xie, Jun; Sun, Xu

doi:10.48550/arxiv.1906.02041

Cited by 20 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also compare to other parallel generation methods. These include a latent variable approach: FlowSeq [28]; refinement-based approaches: CMLM [11], Levenshtein transformer [15] and SMART [12]; a mixed approach: Imputer [41]; reinforcement learning: Imitate-NAT [55]; and another sequence-based approach: NART-DCRF [49] which combines a non-autoregressive model with a 1st-order CRF. Several of these methods use fully autoregressive reranking [13], which generally gives further improvements but requires a separate test-time model.…”

Section: Methodsmentioning

confidence: 99%

“…There has been extensive interest in non-autoregressive/parallel generation approaches, aiming at producing a sequence in parallel sub-linear time w.r.t. sequence length [13,52,26,65,53,14,11,12,48,15,28,16,49,55,30,41,64,62]. Existing approaches can be broadly classified as latent variable based [13,26,65,28,41], refinement-based [25,48,14,15,11,30,12,62] or a combination of both [41].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Cascaded Text Generation with Markov Transformers

Deng,

Rush

2020

Preprint

View full text Add to dashboard Cite

The two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. To parameterize this cascade, we introduce a Markov transformer, a variant of the popular fully autoregressive model that allows us to simultaneously decode with specific autoregressive context cutoffs. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.Preprint. Under review.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Cascaded Text Generation with Markov Transformers

Deng,

Rush

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…This also alleviates the multimodality problem (Gu et al 2017) in non-autoregeressive generation. Many non-autoregeressive translation methods are proposed for better alignment, like fertiity (Gu et al 2017), SoftCopy (Wei et al 2019) or adding reordering module (Ran et al 2019). However, the source and target words in monolingual generation tasks cannot be aligned directly like translation.…”

Section: Motivationmentioning

confidence: 99%

“…We use the BERT-base model (n layers = 12, n heads = 12, d hidden = 768) and BERT-small model (n layers = 12, n heads = 12, d hidden = 384) as our backbone based on huggingface transformers (Wolf et al 2020). The weight of model is respectively initialized by unilm1.2base-uncased (Bao et al 2020), bert-base-uncased (Devlin et al 2018) and MiniLMv1-L12-H384-uncased (Wang et al 2020).…”

Section: Experimental Settingmentioning

confidence: 99%

Improving Non-autoregressive Generation with Mixup Training

Jiang¹,

Huang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

While pre-trained language models have achieved great success on various natural language understanding tasks, how to effectively leverage them into nonautoregressive generation tasks remains a challenge. To solve this problem, we present a non-autoregressive generation model based on pre-trained transformer models. To bridge the gap between autoregressive and non-autoregressive models, we propose a simple and effective iterative training method called MIx Source and pseudo Target (MIST). Unlike other iterative decoding methods, which sacrifice the inference speed to achieve better performance based on multiple decoding iterations, MIST works in the training stage and has no effect on inference time. Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-ofthe-art results for fully non-autoregressive models. We also demonstrate that our method can be used to a variety of pre-trained models. For instance, MIST based on the small pre-trained model also obtains comparable performance with seq2seq models. Our code is available at https://github.com/kongds/MIST.

show abstract

“…The early pioneering work about parallel generation is [42], which generated the words of selected objects first, and the rest of sentences was filled with a two-pass process. Several recent works attempt to accelerate generation by introducing a NAIC framework [14,37], which produces the entire sentences simultaneously. Although accelerating the decoding process significantly, NAIC models suffer from repetitive and missing problems.…”

Section: Related Workmentioning

confidence: 99%

Semi-Autoregressive Image Captioning

Xu¹,

Fei²,

Li³

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-designed experiment, we empirically proved that iteration times can be effectively reduced when providing sufficient prior knowledge for the language decoder. Towards that end, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off between performance and speed. The proposed SAIC model maintains autoregressive property in global but relieves it in local. Specifically, SAIC model first jumpily generates an intermittent sequence in an autoregressive manner, that is, it predicts the first word in every word group in order. Then, with the help of the partially deterministic prior information and image features, SAIC model non-autoregressively fills all the skipped words with one iteration. Experimental results on the MS COCO benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup. Code is available at https://github.com/feizc/SAIC.

show abstract

Imitation Learning for Non-Autoregressive Neural Machine Translation

Cited by 20 publications

References 16 publications

Cascaded Text Generation with Markov Transformers

Cascaded Text Generation with Markov Transformers

Improving Non-autoregressive Generation with Mixup Training

Semi-Autoregressive Image Captioning

Contact Info

Product

Resources

About