2021
DOI: 10.48550/arxiv.2111.05814
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

Abstract: We tackle the cross-modal retrieval problem, where the training is only supervised by the relevant multi-modal pairs in the data. The contrastive learning is the most popular approach for this task. However, its sampling complexity for learning is quadratic in the number of training data points. Moreover, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address these issues, we propose a novel loss function that is based on self-labeling of the unknow… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 50 publications
0
2
0
Order By: Relevance
“…Caron et al [2020] propose SwAV, an image representation technique that instead of comparing representations of different image views, follows a swapped prediction strategy to predict the code of one view from another view of the same image by clustering image representations with a computationally efficient online clustering approach. Similar swapping strategies have been utilized in cross-modal retrieval Kim [2021]. SEER Goyal et al [2021] is trained on large-scale unconstrained and uncurated image collections by improving the scalability of SwAV in terms of GPU memory consumption and training speed.…”
Section: Related Workmentioning
confidence: 99%
“…Caron et al [2020] propose SwAV, an image representation technique that instead of comparing representations of different image views, follows a swapped prediction strategy to predict the code of one view from another view of the same image by clustering image representations with a computationally efficient online clustering approach. Similar swapping strategies have been utilized in cross-modal retrieval Kim [2021]. SEER Goyal et al [2021] is trained on large-scale unconstrained and uncurated image collections by improving the scalability of SwAV in terms of GPU memory consumption and training speed.…”
Section: Related Workmentioning
confidence: 99%
“…In addition to this semantic gap, Luo et al [18] use only self-attentions among all modalities, and their method is highly dependent on pretraining. Kim et al [19] use only soft-attentions and ignore interaction between visual features. While Yang et al [20] utilize a fine-grained alignment with an extra loss and exploit special arrangement for hard negatives, they still use only selfattentions among all modalities by ignoring the interaction between visual features.…”
Section: Related Workmentioning
confidence: 99%