2022
DOI: 10.1109/tip.2022.3147032
|View full text |Cite
|
Sign up to set email alerts
|

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

Abstract: Attempt to fully explore the fine-grained temporal structure and global-local chronological characteristics for selfsupervised video representation learning, this work takes a closer look at exploiting the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, our proposed TCG roots in a hybrid graph contrastive learning strategy to regar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
30
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 117 publications
(30 citation statements)
references
References 93 publications
0
30
0
Order By: Relevance
“…The task of action detection and recognition includes two aspects, the one is to identify all action instances in the video, and the other one is to localize actions spatially and temporally. Nowadays, spatial-temporal action detection or recognition models can be divided into two categories, the first ones [6,[155][156][157][158][159][159][160][161][162][163] is to model spatial-temporal relationships based on Convolutional Neural Networks (CNNs), and the other ones [164][165][166][167][168] is based on video transformer structures. Besides, the skeleton-based models [169,[169][170][171][172] have been recently attracted great attentions.…”
Section: Spatial-temporalmentioning
confidence: 99%
See 1 more Smart Citation
“…The task of action detection and recognition includes two aspects, the one is to identify all action instances in the video, and the other one is to localize actions spatially and temporally. Nowadays, spatial-temporal action detection or recognition models can be divided into two categories, the first ones [6,[155][156][157][158][159][159][160][161][162][163] is to model spatial-temporal relationships based on Convolutional Neural Networks (CNNs), and the other ones [164][165][166][167][168] is based on video transformer structures. Besides, the skeleton-based models [169,[169][170][171][172] have been recently attracted great attentions.…”
Section: Spatial-temporalmentioning
confidence: 99%
“…With the emergence of huge amounts of heterogeneous multi-modal data including images [1][2][3], videos [4][5][6][7], texts/languages [8][9][10], audios [11][12][13][14], and multi-sensor [15][16][17][18] data, deep learning based methods have shown promising performance for various computer vision and machine learning tasks, for example, the visual comprehension [19][20][21], video understanding [22][23][24], visual-linguistic analysis [25][26][27], and multi-modal fusion [28][29][30], etc. However, the existing methods rely heavily upon fitting the data distributions and tend to capture the spurious correlations from different modalities, and thus fail to learn the essential causal relations behind the multi-modal knowledge that have good generalization and cognitive abilities.…”
Section: Introductionmentioning
confidence: 99%
“…Xinlei et al [51] proposed a simple Siamese (SimSiam) network that achieved the best results without negative samples, large batches, and momentum encoders. In addition, the contrastive learning method was applied in the field of video processing, and achieved excellent performance at the time it was proposed [52]…”
Section: Self-supervised Learningmentioning
confidence: 99%
“…Video pretext tasks explored natural video properties or statistics as supervision signal on unlabeled data, e.g., frame prediction (Vondrick, Pirsiavash, and Torralba 2016;Behrmann, Gall, and Noroozi 2021;Luo et al 2017), spatio-temporal puzzling (Kim, Cho, and Kweon 2019), video statistics (Wang et al 2021b), temporal ordering (Misra, Zitnick, and Hebert 2016;Yao et al 2021), video playback rate prediction (Benaim et al 2020;Jenni, Meishvili, and Favaro 2020), temporal consistency (Wang, Jabri, and Efros 2019;Jabri, Owens, and Efros 2020). Recently, inspired by the success of contrastive learning on static images, contrastive learning was expanded in video self-supervised learning Alwassel et al 2019;Sermanet et al 2018;Liu et al 2021). Despite the success of contrastive learning and playback rate prediction, contrastive learning approaches just focus on discriminating instances by the similarity of features but ignore the intermediate state of learned representation such as the similarity degree of features, which limits the overall performance.…”
Section: Self-supervised Video Representation Learningmentioning
confidence: 99%