“…To handle the first redundant self-attention problem in decoder, we change the order of the self-attention and cross-attention in decoder. Thereby, before embarking upon deducing any relations within unknown prediction sequence, the prediction sequence receives the auto-regressive parts from the deepest encoder feature map, which serves as a better role for prediction sequence initialization before the first self-attention in decoder than simple zero-initialization with start token [15], randomly generated parameters [8], the trend decomposition of raw input sequence [25], and so forth. It is evident that the latter initialization formats of other TSFTs are either relatively simple or inefficient.…”