“…cess long sequences (Rae et al, 2020;Beltagy et al, 2020;Zaheer et al, 2020;Roy et al, 2021). Sparse attention , relative position encoding (Shaw et al, 2018;Raffel et al, 2020;Guo et al, 2021), recurrence mechanism and memory (Dai et al, 2019;Weston et al, 2015;Hutchins et al, 2022;? ) and other tricks (Shen et al, 2020;Katharopoulos et al, 2020;Gupta and Berant, 2020;Stock et al, 2021;Yogatama et al, 2021;Borgeaud et al, 2021;Hawthorne et al, 2022) are commonly adopted by recent Transformer variants to make the operation on long sequences more time/memory efficient.…”