2022
DOI: 10.1007/978-3-031-19781-9_19
|View full text |Cite
|
Sign up to set email alerts
|

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
36
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 74 publications
(36 citation statements)
references
References 33 publications
0
36
0
Order By: Relevance
“…Representative works such as CLIP [43] project images and natural language descriptions to a common feature space through two separate encoders for contrastive learning, and achieve significant "zero-shot" transferability by pre-training on hundreds of millions of image-text pairs. Subsequently, these pre-trained models have been extended to various downstream tasks and shown excellent performance, including image classification [81,80], object detection [48,15], semantic segmentation [63,45], and video understanding [34,22,35]. Inspired by these successes, in this work we present the first simple but efficient framework to leverage the rich semantic knowledge of CLIP for fewshot action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…Representative works such as CLIP [43] project images and natural language descriptions to a common feature space through two separate encoders for contrastive learning, and achieve significant "zero-shot" transferability by pre-training on hundreds of millions of image-text pairs. Subsequently, these pre-trained models have been extended to various downstream tasks and shown excellent performance, including image classification [81,80], object detection [48,15], semantic segmentation [63,45], and video understanding [34,22,35]. Inspired by these successes, in this work we present the first simple but efficient framework to leverage the rich semantic knowledge of CLIP for fewshot action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…CLIP4Clip (Luo et al, 2022) finetunes models and investigates three similarity calculation approaches for video-sentence contrastive learning on CLIP (Radford et al, 2021). Further, TS2-Net (Liu et al, 2022b) proposes a novel token shift and selection transformer architecture that adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. Later, DiscreteCodebook (Liu et al, 2022a) propose to align modalities in a space filled with concepts, which are randomly initialled and unsupervisedly updated, while VCM propose to construct a space with unsupervisedly clustered visual concepts.…”
Section: Related Workmentioning
confidence: 99%
“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”
Section: Datasets and Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…This process involves searching for a returned video or captions with a given cross-model query and has gained increasing attention from researchers [11,19,22,53,87]. In the past years, several video-text benchmarks [1,8,10,71,90] have been proposed to measure performance, which advances the development of video-text retrieval [31,46,50,94].…”
Section: Introductionmentioning
confidence: 99%