Boosting Video-Text Retrieval with Explicit High-Level Semantics

Wang, Haoran; Xu, Di; He, Dongliang; Li, Fu; Ji, Zhong; Han, Jungong; Ding, Errui

doi:10.1145/3503161.3548010

Cited by 10 publications

(2 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”

Section: Datasets and Baselinesmentioning

confidence: 99%

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

show abstract

Section: Datasets and Baselinesmentioning

confidence: 99%

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To show the empirical efficiency of our S3MA, we train it on MSR-VTT (Xu et al, 2016), MSVD Dolan, 2011), andActivi-tyNet (Fabian Caba Heilbron andNiebles, 2015). We compare with VLM (Xu et al, 2021a), HERO (Li et al, 2020a), VideoCLIP (Xu et al, 2021b), EvO (Shvetsova et al, 2022), OA-Trans (Wang et al, 2022a), RaP (Wu et al, 2022), LiteVL , NCL (Park et al, 2022b), TABLE (Chen et al, 2023), VOP (Huang et al, 2023), Clip4Clip (Luo et al, 2022), X-CLIP (Ma et al, 2022a), DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), VCM , HiSE (Wang et al, 2022b), Align&Tell (Wang et al, 2022c), Center-CLIP (Zhao et al, 2022), andX-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”

Section: Datasets and Baselinesmentioning

confidence: 99%

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

Wang,

Shi

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words.With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at link.

show abstract

CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Zhao

Liu

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Cited by 10 publications

References 63 publications

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Contact Info

Product

Resources

About