TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Liu, Yuqi; Xiong, Pengfei; Xu, Luhui; Cao, Shengming; Jin, Qin

doi:10.1007/978-3-031-19781-9_19

Cited by 74 publications

(36 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Representative works such as CLIP [43] project images and natural language descriptions to a common feature space through two separate encoders for contrastive learning, and achieve significant "zero-shot" transferability by pre-training on hundreds of millions of image-text pairs. Subsequently, these pre-trained models have been extended to various downstream tasks and shown excellent performance, including image classification [81,80], object detection [48,15], semantic segmentation [63,45], and video understanding [34,22,35]. Inspired by these successes, in this work we present the first simple but efficient framework to leverage the rich semantic knowledge of CLIP for fewshot action recognition.…”

Section: Related Workmentioning

confidence: 99%

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Wang¹,

Zhang²,

Cen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Learning from large-scale contrastive languageimage pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-theart methods under various settings. The source code and

show abstract

Section: Related Workmentioning

confidence: 99%

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Wang¹,

Zhang²,

Cen³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…CLIP4Clip (Luo et al, 2022) finetunes models and investigates three similarity calculation approaches for video-sentence contrastive learning on CLIP (Radford et al, 2021). Further, TS2-Net (Liu et al, 2022b) proposes a novel token shift and selection transformer architecture that adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. Later, DiscreteCodebook (Liu et al, 2022a) propose to align modalities in a space filled with concepts, which are randomly initialled and unsupervisedly updated, while VCM propose to construct a space with unsupervisedly clustered visual concepts.…”

Section: Related Workmentioning

confidence: 99%

“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”

Section: Datasets and Baselinesmentioning

confidence: 99%

“…"Rep" refers to representation. Liu et al, 2022b) mainly aims to learn a joint feature space across modalities and then compares representations in this space. However, as the video and text representations often come from modality-independent encoders, it is challenging to directly compare and calculate the similarities between representations of different modalities from different encoders.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

show abstract

“…This process involves searching for a returned video or captions with a given cross-model query and has gained increasing attention from researchers [11,19,22,53,87]. In the past years, several video-text benchmarks [1,8,10,71,90] have been proposed to measure performance, which advances the development of video-text retrieval [31,46,50,94].…”

Section: Introductionmentioning

confidence: 99%

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Wang

et al. 2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Recently, masked video modeling has been widely explored and significantly improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate highinformed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design dualmask co-learning to incorporate video cues under different masks and learn more aligned video representation. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo. Extensive ablation studies demonstrate the effectiveness of the proposed schemes.

show abstract

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Cited by 74 publications

References 33 publications

CLIP-guided Prototype Modulating for Few-shot Action Recognition

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Contact Info

Product

Resources

About