Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021
DOI: 10.1145/3404835.3462927
|View full text |Cite
|
Sign up to set email alerts
|

Improving Video Retrieval by Adaptive Margin

Abstract: Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(10 citation statements)
references
References 40 publications
0
9
0
Order By: Relevance
“…[3,5,27,41,45,50] are most widely used datasets in video-text retrieval (VTR). Early works [10,15,20,28,47] used offline features extracted by expert models for modal fusion. Since the emergence of the CLIP [38] model, [31,37] transfer CLIP to the VTR task.…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…[3,5,27,41,45,50] are most widely used datasets in video-text retrieval (VTR). Early works [10,15,20,28,47] used offline features extracted by expert models for modal fusion. Since the emergence of the CLIP [38] model, [31,37] transfer CLIP to the VTR task.…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
“…Video text retrieval (VTR) [10,12,15,20,28,29,31,32,34,37,43,47], aiming to obtain the rankings of videos/texts in a repository given text/video queries (i.e. T2V and V2T respectively) is a critical multimodal research topic with a wide range of practical applications.…”
Section: Introductionmentioning
confidence: 99%
“…Some approaches [4], [26] focus on the bias of strict assumption of video-text retrieval, i.e., only a single text is relevant to a query video and vice versa [25]. Hence, some approaches [4], [25]- [27] are proposed to model the one-to-many or manyto-many correspondences in the retrieval task. For example, Patrick et al [4] introduce a multi-modal cross-instance text generation task as the auxiliary to extract the inner one-tomany correspondences of instances for video-text retrieval.…”
Section: B the Bias In Video-text Retrievalmentioning
confidence: 99%
“…For textimage retrieval, [49] propose a scheduled adaptive margin which starts from a fixed value and gradually changes during the training process both to integrate inter-category similarity-based correlations and to preserve the category clusters formed during the initial phases of the training. Recently, for cross-modal video retrieval [25] proposed an adaptive margin proportional to the similarity of the representations computed for the negative pair, both in terms of 'static' (pretrained, frozen) models, which provide initial supervision, and 'dynamic' (trained with the task) models, which provide supervision in later stages of the training. Differently from all these works, we propose a margin which is proportional to the relevance value of the queries involved in the triplet, effectively using the semantic knowledge during training.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, [49] implemented a schedule for the margin value which gradually incorporates inter-category correlations and information about the structure of the embedding space. Recently, for video retrieval [25] proposed an adaptive margin proportional to the similarity of item and query as computed by multiple models. Differently from them, we propose to inject semantic knowledge into the training process by means of a relevance-based margin.…”
Section: Introductionmentioning
confidence: 99%