Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence 2021
DOI: 10.24963/ijcai.2021/154
|View full text |Cite
|
Sign up to set email alerts
|

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Abstract: Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

2
7

Authors

Journals

citations
Cited by 17 publications
(8 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…Text-Video Retrieval is to find the most semantic-relevant video given a text query (text → video). Early research devotes to distilling knowledge from "expert" models based on offline-extracted single-modality features [8,13,26,44]. However, the performance is far from satisfactory due to the significant domain gap.…”
Section: Related Workmentioning
confidence: 99%
“…Text-Video Retrieval is to find the most semantic-relevant video given a text query (text → video). Early research devotes to distilling knowledge from "expert" models based on offline-extracted single-modality features [8,13,26,44]. However, the performance is far from satisfactory due to the significant domain gap.…”
Section: Related Workmentioning
confidence: 99%
“…With the release of the large-scale instructional dataset HowTo100M, VTP has spurred significant interest in the community. Overall, the mainstream methods can be broadly classified into two classes: 1) Generative methods: Several methods [28,34,50,11,20,56,31,55] try to extend BERT [53] to the cross-modal domain, i.e., they accept both visual and textual tokens as input and perform the masked-token prediction task. 2) Discriminative methods.…”
Section: Related Workmentioning
confidence: 99%
“…Video-Language Pre-training. Early VLP methods [30,56,45,53,17,48] introduce pretrained models on other tasks to pre-extract video representations. Some of them [30,56,45] utilize action recognition backbones [15,19] to pre-extract video representations.…”
Section: Related Workmentioning
confidence: 99%
“…These backbones are designed with 2D [20] and 3D [19] CNNs to capture spatial and temporal information in videos. Others [53,35,17,48] fuse multiple "Experts" that are trained on different modalities, such as audio classification [21], OCR [17], image classification [22] and so on, to fully exploit cross-modal high-level semantics in videos. Recently, end-to-end models [37,29,4,54,16] are proposed .…”
Section: Related Workmentioning
confidence: 99%