“…Following ViT, many transformer-based architectures such as PCT [27], IPT [79], T2T-ViT [44], DeepViT [167], SETR [81], PVT [45], CaiT [168], TNT [82], Swin-transformer [46], Query2Label [83], MoCoV3 [84], BEiT [85], SegFormer [86], FuseFormer [169], and MAE [170] have appeared, with excellent results for many kind of visual tasks including image classification, object detection, semantic segmentation, point cloud processing, action recognition, and self-supervised learning.…”