“…Currently, the dominant approaches are auto-regressive models, such as Recurrent Neural Network (Mikolov et al, 2011), Transformer (Vaswani et al, 2017), and Convolutional Seq2Seq (Gehring et al, 2017), which have achieved impressive performances for the task of language generation using the Maximum Likelihood Estimation (MLE) method. Nevertheless, some studies reveal that such settings may have three main drawbacks: First, the MLE method makes the generative model extremely sensitive to rare samples, which results in the learned distribution being too conservative (Feng and McCulloch, 1992;Ahmad and Ahmad, 2019). Second, autoregressive generation models suffer from exposure bias due to the dependence on the previous sampled output during the inferring phase.…”