Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Chen, Yanbei; Bazzani, Loris

doi:10.1007/978-3-030-58542-6_9

Cited by 46 publications

(17 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At present, CNN-based image hashing network has also achieved state-of-the-art results [16,25,53,58], but little attention to video hashing. The main categories of image retrieval are as follows: image-image [5,8,13,14,26,42,43], image-text-image [11,12,54] and hashing algorithms [53,58]. Most researchers focus on image retrieval, and less attention has been paid to video hashing.…”

Section: Related Workmentioning

confidence: 99%

Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Pei¹,

Zhao²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Conventional fake video detection methods outputs a possibility value or a suspected mask of tampering images. However, such unexplainable results cannot be used as convincing evidence. So it is better to trace the sources of fake videos. The traditional hashing methods are used to retrieve semantic-similar images, which can't discriminate the nuances of the image. Specifically, the sources tracing compared with traditional video retrieval. It is a challenge to find the real one from similar source videos. We designed a novel loss Hash Triplet Loss to solve the problem that the videos of people are very similar: the same scene with different angles, similar scenes with the same person. We propose Vision Transformer based models named Video Tracing and Tampering Localization (VTL). In the first stage, we train the hash centers by ViTHash (VTL-T). Then, a fake video is inputted to ViTHash, which outputs a hash code. The hash code is used to retrieve the source video from hash centers. In the second stage, the source video and fake video are inputted to generator (VTL-L). Then, the suspect regions are masked to provide auxiliary information. Moreover, we constructed two datasets: DFTL and DAVIS2016-TL. Experiments on DFTL clearly show the superiority of our framework in sources tracing of similar videos. In particular, the VTL also achieved comparable performance with state-of-the-art methods on DAVIS2016-TL. Our source code and datasets have been released on GitHub: https://github.com/lajlksdf/vtl.

show abstract

Section: Related Workmentioning

confidence: 99%

Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Pei¹,

Zhao²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…where x av , x iv are the compositional embeddings. This operation is related to prior works that compose multi-modal features [61,14,13], but ours aims at shifting the teacher embedding with a learnable residual. More importantly, to constrain the class assignment of the compositional embeddings, F(•, •) is optimised by the video classification loss (i.e.…”

Section: Compositional Multi-modal Representationsmentioning

confidence: 99%

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Chen¹,

Xian²,

Koepke³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VG-GSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. Code is released here: https://github.com/yanbeic/CCL.

show abstract

“…Based on our observation that image samples are more robustly represented in the joint space, and the task's tendency to incorporate modification sentences into an image representation, we compose text embeddings with image embeddings from the joint space instead, and observe that this outperforms composing with image embeddings from the pretrained embedding space. Parallel to our work, [7] has incorporated side information into the text-based retrieval task. Though the approach is similar, we see the improvement as a proof of concept.…”

Section: Prior Work Fusion Of Vision and Languagementioning

confidence: 99%