2020
DOI: 10.1007/978-3-030-58542-6_9
|View full text |Cite
|
Sign up to set email alerts
|

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
2

Relationship

1
9

Authors

Journals

citations
Cited by 46 publications
(17 citation statements)
references
References 34 publications
0
17
0
Order By: Relevance
“…At present, CNN-based image hashing network has also achieved state-of-the-art results [16,25,53,58], but little attention to video hashing. The main categories of image retrieval are as follows: image-image [5,8,13,14,26,42,43], image-text-image [11,12,54] and hashing algorithms [53,58]. Most researchers focus on image retrieval, and less attention has been paid to video hashing.…”
Section: Related Workmentioning
confidence: 99%
“…At present, CNN-based image hashing network has also achieved state-of-the-art results [16,25,53,58], but little attention to video hashing. The main categories of image retrieval are as follows: image-image [5,8,13,14,26,42,43], image-text-image [11,12,54] and hashing algorithms [53,58]. Most researchers focus on image retrieval, and less attention has been paid to video hashing.…”
Section: Related Workmentioning
confidence: 99%
“…where x av , x iv are the compositional embeddings. This operation is related to prior works that compose multi-modal features [61,14,13], but ours aims at shifting the teacher embedding with a learnable residual. More importantly, to constrain the class assignment of the compositional embeddings, F(•, •) is optimised by the video classification loss (i.e.…”
Section: Compositional Multi-modal Representationsmentioning
confidence: 99%
“…Based on our observation that image samples are more robustly represented in the joint space, and the task's tendency to incorporate modification sentences into an image representation, we compose text embeddings with image embeddings from the joint space instead, and observe that this outperforms composing with image embeddings from the pretrained embedding space. Parallel to our work, [7] has incorporated side information into the text-based retrieval task. Though the approach is similar, we see the improvement as a proof of concept.…”
Section: Prior Work Fusion Of Vision and Languagementioning
confidence: 99%