“…Venn diagram displaying our proposed taxonomy of efficient VT designs (best viewed in color). TS (TimeSformer) and TSx[9], STVG-BERT[117], SCT[90], AVT[80], ViViT[11], SAVM[60], LVT[61], HERO[14], VideoSwin[12], VATT[49], MViT[48], COOT[71], SMT[134], Perceiver[130], STTran[131], VATNet[10], MART[58], HISAN[81], Dyadformer[142], PE[64], VTN[99], VMTN[129], PCSA[119], MDAM[55], VATT[49], TrDIMP[82], PMPNet[42], and Transfuser[132]. We describe Local, Axial and Sparse approaches in Sec.3.2.1, Hierarchical and Small-Q(ueries) in Sec.…”