“…Recently, Transformers have demonstrated great successes in computer vision, such as image classification [10,16,20,21,23], object detection [3,4,24,27], semantic segmentation [1,19], and action recognition [7,13,15]. Thanks to the self-attention based architectures, Vision Transformers (ViT) [6] outperforms the classical Convolutional Neural Networks (CNN) [17] which achieves the state-of-theart results.…”