“…For example, in text-to-video retrieval, the objective is to rank gallery videos based on the features of the query text. Recently, inspired by the success in self-supervised learning (Radford et al, 2021), significant progress has been made in CMR, including image-text retrieval (Radford et al, 2021;Li et al, 2020;Wang et al, 2020a), video-text retrieval (Chen et al, 2020;Cheng et al, 2021;Gao et al, 2021;Lei et al, 2021;Ma et al, 2022;Park et al, 2022;Wang et al, 2022a,b;Zhao et al, 2022;Wang and Shi, 2023;, and audiotext retrieval (Oncescu et al, 2021), with satisfactory retrieval performances.…”