“…Many other ViT variants [8,13,21,22,25,37,54,60,70] are proposed from then, which achieve promising performance compared with its counterpart CNNs for image analysis tasks [6,23,74]. Recently, some works introduce vision transformer for video understanding tasks such as action recognition [1,3,4,15,20,38,42], action detection [36,58,62,73], video superresolution [5], video inpainting [32,71], and 3D animation [9]. Some works [20,42] conduct temporal contextual modeling with transformer based on single-frame features from pretrained 2D networks, while other works [1,3,4,15,38] mine the spatio-temporal attentions via video transformer directly.…”