Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval 2018
DOI: 10.1145/3206025.3206064
|View full text |Cite
|
Sign up to set email alerts
|

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Abstract: Constructing a joint representation invariant across dierent modalities (e.g., video, language) is of signicant importance in many multimedia applications. While there are a number of recent successes in developing eective image-text retrieval methods by learning joint representations, the video-text retrieval task, however, has not been explored to its fullest extent. In this paper, we study how to eectively utilize available multimodal cues from videos for the cross-modal video-text retrieval task. Based on … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
109
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 210 publications
(109 citation statements)
references
References 29 publications
0
109
0
Order By: Relevance
“…With big advances of deep learning in natural language processing and computer vision research, we observe an increased use of such techniques for video retrieval [7,24,34,36,37]. By directly encoding videos and text into a common space, these methods are concept free.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With big advances of deep learning in natural language processing and computer vision research, we observe an increased use of such techniques for video retrieval [7,24,34,36,37]. By directly encoding videos and text into a common space, these methods are concept free.…”
Section: Related Workmentioning
confidence: 99%
“…Though our goal is zeroexample video retrieval, which corresponds to text-to-video retrieval in the table, video-to-text retrieval is also included for completeness. While [7] is less effective than [24], letting the former use the same loss function as the latter brings in a considerable performance gain, with the sum of recalls increased from 90.3 to 132.1. The result suggests the importance of assessing different video / text encoding strategies within the same common space learning framework.…”
Section: Experiments On Msr-vttmentioning
confidence: 99%
“…Each epoch training is just performed using a single GPU and takes no more than 10 minutes. [30] , MEE [27], MMEN [43], and JPoSE [43], and (3) other methods: JSFusion [49], CCA (FV HGLMM) [16], and Miech et al [26]. The experimental results on MSR-VTT and LSMDC are summarized, respectively, in Table 1 and Table 2.…”
Section: Methodsmentioning
confidence: 99%
“…Numerous publications in recent years deal with multimodal information in retrieval tasks. The general problem of reduc-ing or bridging the semantic gap [44] between images and text is the main issue in cross-media retrieval [3,34,35,39,50]. Fan et al [8] tackle this problem by modeling humans' visual and descriptive senses with a multi-sensory fusion network.…”
Section: Multimedia Information Retrievalmentioning
confidence: 99%