“…Interestingly, the best performing model turned out to be nearly equivalent to the base model (described in Section 3.3), differing only in that it used 512-dimensional additive attention. While not the focus on this work, we were able to achieve further improvements by combining all of our insights into a single model described in Table 7 (Jean et al, 2015), RNNSearch-LV (Jean et al, 2015), BPE (Sennrich et al, 2016b), BPE-Char (Chung et al, 2016), Deep-Att , Luong (Luong et al, 2015a), Deep-Conv (Gehring et al, 2016), GNMT (Wu et al, 2016), and OpenNMT (Klein et al, 2017). Systems with an * do not have a public implementation.…”