“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”