Improving Video Retrieval by Adaptive Margin

He, Feng; Wang, Qi; Feng, Zhifan; Jiang, Wenbin; Lu, Ye; Zhu, Yong; Tan, Xiao

doi:10.1145/3404835.3462927

Cited by 15 publications

(10 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[3,5,27,41,45,50] are most widely used datasets in video-text retrieval (VTR). Early works [10,15,20,28,47] used offline features extracted by expert models for modal fusion. Since the emergence of the CLIP [38] model, [31,37] transfer CLIP to the VTR task.…”

Section: Parameter-efficient Transfer Learningmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zhang¹,

Jin²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e.g. CLIP) on specific datasets, which may suffer from substantial storage costs in practical applications since a separate model per task needs to be stored. To overcome this issue, we present the premier work on performing parameter-efficient VTR from the pre-trained model, i.e., only a small number of parameters are tunable while freezing the backbone. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter adopts bottleneck structures in both video and text branches and introduces two novel components. The first is a Temporal Adaptation Module employed in the video branch to inject global and local temporal contexts. We also learn weights calibrations to adapt to the dynamic variations across frames. The second is a Cross-Modal Interaction Module that generates weights for video/text branches through a shared parameter space, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve on-par or better performance than standard fine-tuning with negligible parameters overhead. Notably, on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet), MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins. Codes will be released.

show abstract

Section: Parameter-efficient Transfer Learningmentioning

confidence: 99%

“…Video text retrieval (VTR) [10,12,15,20,28,29,31,32,34,37,43,47], aiming to obtain the rankings of videos/texts in a repository given text/video queries (i.e. T2V and V2T respectively) is a critical multimodal research topic with a wide range of practical applications.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zhang¹,

Jin²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Some approaches [4], [26] focus on the bias of strict assumption of video-text retrieval, i.e., only a single text is relevant to a query video and vice versa [25]. Hence, some approaches [4], [25]- [27] are proposed to model the one-to-many or manyto-many correspondences in the retrieval task. For example, Patrick et al [4] introduce a multi-modal cross-instance text generation task as the auxiliary to extract the inner one-tomany correspondences of instances for video-text retrieval.…”

Section: B the Bias In Video-text Retrievalmentioning

confidence: 99%

Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Zhang

Yang

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic videotext representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. There may exist soft positive samples, which are sampled as negatives but share the same semantics as positive samples. Indiscriminately enforcing the model to push all the negative samples away from the query leads to inaccurate supervision and then misleads the video-text feature representation learning. In this paper, we introduce debiased video-text retrieval objectives that calibrate the punishment of soft positive samples. In particular, we propose a novel uncertainty measure framework to estimate the credibility of negative samples for each instance. Then, the reliability of negative samples is used to find the soft positive samples and rescale their contribution within video-text retrieval losses, including triplet loss and contrastive loss. Experimental results on five widely used datasets demonstrate that our debiased video-text retrieval objectives achieve significant performance improvements and establish a new state-of-the-art.

show abstract

“…For textimage retrieval, [49] propose a scheduled adaptive margin which starts from a fixed value and gradually changes during the training process both to integrate inter-category similarity-based correlations and to preserve the category clusters formed during the initial phases of the training. Recently, for cross-modal video retrieval [25] proposed an adaptive margin proportional to the similarity of the representations computed for the negative pair, both in terms of 'static' (pretrained, frozen) models, which provide initial supervision, and 'dynamic' (trained with the task) models, which provide supervision in later stages of the training. Differently from all these works, we propose a margin which is proportional to the relevance value of the queries involved in the triplet, effectively using the semantic knowledge during training.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, [49] implemented a schedule for the margin value which gradually incorporates inter-category correlations and information about the structure of the embedding space. Recently, for video retrieval [25] proposed an adaptive margin proportional to the similarity of item and query as computed by multiple models. Differently from them, we propose to inject semantic knowledge into the training process by means of a relevance-based margin.…”

Section: Introductionmentioning

confidence: 99%

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Falcon

Sudhakaran

Serra

et al. 2022

Proceedings of the 2022 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at https://github.com/aranciokov/RelevanceMargin-ICMR22.

show abstract

Improving Video Retrieval by Adaptive Margin

Cited by 15 publications

References 40 publications

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Contact Info

Product

Resources

About