“…As an emerging and challenging cross-modal task, video moment retrieval using language (VMR) (Anne Hendricks et al 2017;Gao et al 2017) has drawn increasing attention in recent years due to its various applications, such as video understanding (Liu et al 2023h, 2020(Liu et al 2023h, , 2021b(Liu et al 2023h, , 2023b(Liu et al 2023h, , 2022a(Liu et al 2023h, , 2021a(Liu et al 2023h, , 2023g,a, 2022c(Liu et al 2023h, , 2023c(Liu et al 2023h, ,d, 2022bFang et al , 2021a and temporal action localization (Zhang et al 2020b;Fang et al 2022Fang et al , 2023aJi et al 2023e, 2018Ji et al 2023e, , 2023g,f,d,c, 2021Ji et al 2023e, , 2020Ji et al 2023e, , 2019. As shown in Figure 1(a), the VMR task targets locating a video…”