“…To show the empirical efficiency of our S3MA, we train it on MSR-VTT (Xu et al, 2016), MSVD Dolan, 2011), andActivi-tyNet (Fabian Caba Heilbron andNiebles, 2015). We compare with VLM (Xu et al, 2021a), HERO (Li et al, 2020a), VideoCLIP (Xu et al, 2021b), EvO (Shvetsova et al, 2022), OA-Trans (Wang et al, 2022a), RaP (Wu et al, 2022), LiteVL , NCL (Park et al, 2022b), TABLE (Chen et al, 2023), VOP (Huang et al, 2023), Clip4Clip (Luo et al, 2022), X-CLIP (Ma et al, 2022a), DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), VCM , HiSE (Wang et al, 2022b), Align&Tell (Wang et al, 2022c), Center-CLIP (Zhao et al, 2022), andX-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”