“…Touvron et al [13] improve the training strategy of ViT and propose a knowledge distillation method, which helps ViT achieve performance comparable with CNNs trained only on ImageNet. Then various works endeavor to explore efficient vision Transformer architecture designs, e.g., PVT [16,35], Swin [14,60], Twins [15], MViT [17,37], and others [61,36,62,38,39,63,26,64]. Transformer also presents its superiority on various tasks, e.g., object detection [11,23], segmentation [21,24,20,65], pose estimation [22,22], tracking [19,66] and GAN [25,67].…”