With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic videotext representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. There may exist soft positive samples, which are sampled as negatives but share the same semantics as positive samples. Indiscriminately enforcing the model to push all the negative samples away from the query leads to inaccurate supervision and then misleads the video-text feature representation learning. In this paper, we introduce debiased video-text retrieval objectives that calibrate the punishment of soft positive samples. In particular, we propose a novel uncertainty measure framework to estimate the credibility of negative samples for each instance. Then, the reliability of negative samples is used to find the soft positive samples and rescale their contribution within video-text retrieval losses, including triplet loss and contrastive loss. Experimental results on five widely used datasets demonstrate that our debiased video-text retrieval objectives achieve significant performance improvements and establish a new state-of-the-art.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.