“…To alleviate this problem, some researchers modify the transformer architecture by adding alignment modules that predict the to-be-aligned target token (Zenkel et al, 2019(Zenkel et al, , 2020 or modify the training loss by designing an alignment loss computed with full target sentence (Garg et al, 2019;Zenkel et al, 2020). Others argue that using only attention weights is insufficient for generating clean word alignment and propose to induce alignments with feature importance measures, such as leaveone-out measures (Li et al, 2019) and gradientbased measures (Ding et al, 2019). However, all previous work induces alignment for target word y i at step i, when y i is the decoder output.…”