“…In sequential problems of NLP (Bahdanau, Cho, and Bengio 2014;Vaswani et al 2017;Lin et al 2017b;Xu et al 2015), attention mechanisms are widely adopted in recurrent neural networks (RNN) (Pang et al 2019), Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997), SCS (Pang et al 2020b), and Transformer (Vaswani et al 2017) to capture the relationships between words or sentences. In computer vision, many tasks like fine-grained recognition (Fu, Zheng, and Mei 2017;Wang et al 2015;Fang et al 2018;Pang et al 2020c), image captioning (Anderson et al 2018;Anne Hendricks et al 2016;Xu et al 2015), classification (Mnih et al 2014;Hu, Shen, and Sun 2018;Woo et al 2018;Wang et al 2017;Tang et al 2020), and segmentation (Ren and Zemel 2017;Chen et al 2016;Cao et al 2020) also utilize attention mechanisms based on soft attention maps or boundingboxes to search salient areas. Moreover, self-attention structures (Wang et al 2018;Zhu et al 2019;Huang et al 2018;Dai et al 2019) focusing on the combination weight of elements (pixels in vision) are another attention method that adopts adjacent matrix to present attentions.…”