2023
DOI: 10.1109/tcsvt.2023.3248873
|View full text |Cite
|
Sign up to set email alerts
|

Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Abstract: With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic videotext representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 47 publications
0
1
0
Order By: Relevance
“…To enhance the cross-modal interaction capabilities of adapters, CALIP [21] leverages attention maps to fuse text and image features and inserts two fine-tunable linear layers before and after fusion. In addition, Cross-Modal Adapter (CMA) [32] and Multimodal Video Adapter (MV-Adapter) [94] achieve cross-modal interaction by sharing adapter weights between two modalities. These methods consider both single-modal and multi-modal scenarios but do not fully integrate the advantages of each modality.…”
Section: Multi-modal Adapter-based Fine-tuning Adaptationmentioning
confidence: 99%
“…To enhance the cross-modal interaction capabilities of adapters, CALIP [21] leverages attention maps to fuse text and image features and inserts two fine-tunable linear layers before and after fusion. In addition, Cross-Modal Adapter (CMA) [32] and Multimodal Video Adapter (MV-Adapter) [94] achieve cross-modal interaction by sharing adapter weights between two modalities. These methods consider both single-modal and multi-modal scenarios but do not fully integrate the advantages of each modality.…”
Section: Multi-modal Adapter-based Fine-tuning Adaptationmentioning
confidence: 99%