“…DeiT [42] strengthens ViT by introducing a powerful training recipe and adopting knowledge distillation. Built upon the success of ViT, many efforts have been devoted to improving ViT and adapting it into various vision tasks including image classification [42,43,15,53,26,32,20], object localization/detection [18,51,32,17] and image segmentation [51,32,40,7].…”