2022
DOI: 10.48550/arxiv.2204.03382
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HunYuan_tvr for Text-Video Retrieval

Abstract: Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 23 publications
0
2
0
Order By: Relevance
“…Most existing approaches (Cao et al 2022;Chen, Liu, and Albanie 2021;Wang et al 2022b;Liu et al 2021;Yang et al 2021;Suin and Rajagopalan 2020;Hou et al 2020;Chen, Liu, and Albanie 2021;Ryu et al 2021;Lin, Gan, and Wang 2021;Xu et al 2019;Chen et al 2019;Zhang, Song, and Jin 2022) explore task specific modules for different tasks. For example, for the video retrieval task, HiT (Liu et al 2021) and Hunyuan tvr (Min et al 2022) use a hierarchical matching strategy for crossmodal interaction. For the video captioning task, Open-book Apart from task specific modules, we believe a powerful video encoder can bring performance gain in any videolanguage tasks.…”
Section: Video-language Taskmentioning
confidence: 99%
See 1 more Smart Citation
“…Most existing approaches (Cao et al 2022;Chen, Liu, and Albanie 2021;Wang et al 2022b;Liu et al 2021;Yang et al 2021;Suin and Rajagopalan 2020;Hou et al 2020;Chen, Liu, and Albanie 2021;Ryu et al 2021;Lin, Gan, and Wang 2021;Xu et al 2019;Chen et al 2019;Zhang, Song, and Jin 2022) explore task specific modules for different tasks. For example, for the video retrieval task, HiT (Liu et al 2021) and Hunyuan tvr (Min et al 2022) use a hierarchical matching strategy for crossmodal interaction. For the video captioning task, Open-book Apart from task specific modules, we believe a powerful video encoder can bring performance gain in any videolanguage tasks.…”
Section: Video-language Taskmentioning
confidence: 99%
“…For video captioning, we compare with ORG-TRL (Zhang et al 2020) method vs. 64 in SwinBERT (Lin et al 2022)). For video retrieval, we compare with TS2-Net (Liu et al 2022), Hunyuan tvr (Min et al 2022), and video-language pre-trained model OmniVL (Wang et al 2022a). All methods use the same frame resolution.We observe that with minor adaptions and only a small part of parameters updated, our method still achieves comparable performance.…”
Section: Comparison With Sotamentioning
confidence: 99%