Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Mithun, Niluthpol Chowdhury; Li, Juncheng; Metze, Florian; Roy-Chowdhury, Amit K.

doi:10.1145/3206025.3206064

Cited by 210 publications

(109 citation statements)

References 29 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…With big advances of deep learning in natural language processing and computer vision research, we observe an increased use of such techniques for video retrieval [7,24,34,36,37]. By directly encoding videos and text into a common space, these methods are concept free.…”

Section: Related Workmentioning

confidence: 99%

“…Though our goal is zeroexample video retrieval, which corresponds to text-to-video retrieval in the table, video-to-text retrieval is also included for completeness. While [7] is less effective than [24], letting the former use the same loss function as the latter brings in a considerable performance gain, with the sum of recalls increased from 90.3 to 132.1. The result suggests the importance of assessing different video / text encoding strategies within the same common space learning framework.…”

Section: Experiments On Msr-vttmentioning

confidence: 99%

See 1 more Smart Citation

Dual Encoding for Zero-Example Video Retrieval

Dong

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

261

199

View full text Add to dashboard Cite

This paper attacks the challenging problem of zeroexample video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and endto-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Experiments On Msr-vttmentioning

confidence: 99%

Dual Encoding for Zero-Example Video Retrieval

Dong

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

261

199

View full text Add to dashboard Cite

show abstract

“…Each epoch training is just performed using a single GPU and takes no more than 10 minutes. [30] , MEE [27], MMEN [43], and JPoSE [43], and (3) other methods: JSFusion [49], CCA (FV HGLMM) [16], and Miech et al [26]. The experimental results on MSR-VTT and LSMDC are summarized, respectively, in Table 1 and Table 2.…”

Section: Methodsmentioning

confidence: 99%

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Yang

Dong

Cao

et al. 2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

121

View full text Add to dashboard Cite

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval; Video search.

show abstract

“…Numerous publications in recent years deal with multimodal information in retrieval tasks. The general problem of reduc-ing or bridging the semantic gap [44] between images and text is the main issue in cross-media retrieval [3,34,35,39,50]. Fan et al [8] tackle this problem by modeling humans' visual and descriptive senses with a multi-sensory fusion network.…”

Section: Multimedia Information Retrievalmentioning

confidence: 99%

Characterization and classification of semantic image-text relations

Otto

Springstein

Anand

et al. 2020

Int J Multimed Info Retr

View full text Add to dashboard Cite

The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

show abstract

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Cited by 210 publications

References 29 publications

Dual Encoding for Zero-Example Video Retrieval

Dual Encoding for Zero-Example Video Retrieval

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Characterization and classification of semantic image-text relations

Contact Info

Product

Resources

About