2022
DOI: 10.48550/arxiv.2207.00419
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Supervised Learning for Videos: A Survey

Abstract: The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, the use of human-generated annotations leads to models with biased learning, poor domain generalization, and poor robustness. Obtaining annotations is also expensive and requires great effort, which is especially challenging for videos. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 94 publications
0
7
0
Order By: Relevance
“…Semi-supervised learning Semi-supervised learning utilizes both labeled and unlabelled samples for training (Sohn et al 2020;Berthelot et al 2019b,a;Tarvainen and Valpola 2017;Oliver et al 2018;Miyato et al 2018;Yang et al 2021;Schiappa, Rawat, and Shah 2022), generally using regularization (Rasmus et al 2015;Tarvainen and Valpola 2017;Sajjadi, Javanmardi, and Tasdizen 2016;Laine and Aila 2017) or pseudo-labeling (Li et al 2021;Lee 2013;Rizve et al 2021b) methods for classification (Berthelot et al 2019b,a;Rizve et al 2021b) and detection (Kumar and Rawat 2022;Rosenberg, Hebert, and Schneiderman 2005).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Semi-supervised learning Semi-supervised learning utilizes both labeled and unlabelled samples for training (Sohn et al 2020;Berthelot et al 2019b,a;Tarvainen and Valpola 2017;Oliver et al 2018;Miyato et al 2018;Yang et al 2021;Schiappa, Rawat, and Shah 2022), generally using regularization (Rasmus et al 2015;Tarvainen and Valpola 2017;Sajjadi, Javanmardi, and Tasdizen 2016;Laine and Aila 2017) or pseudo-labeling (Li et al 2021;Lee 2013;Rizve et al 2021b) methods for classification (Berthelot et al 2019b,a;Rizve et al 2021b) and detection (Kumar and Rawat 2022;Rosenberg, Hebert, and Schneiderman 2005).…”
Section: Related Workmentioning
confidence: 99%
“…For video action detection, using pseudo-labeling approach for semi-supervised learning becomes costly and difficult with limited labels (Zhang, Zhao, and Wang 2022;Schiappa, Rawat, and Shah 2022). The pseudo-labeling approach also assumes that a pre-trained object detector or region proposal is available (Ren et al 2020;Zhang, Zhao, and Wang 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Video self-supervised learning. Self-supervised learning has been widely used in a variety of different areas, including the challenging task of video representation learning (Schiappa, Rawat, and Shah 2022). Several prior works have attempted to learn video representations through both uni-modal (Feichtenhofer et al 2021;Qian et al 2021;Jing et al 2018) as well as multimodal Recasens et al 2021;Xiao, Tighe, and Modolo 2022;Han, Xie, and Zisserman 2020) pretraining.…”
Section: Related Workmentioning
confidence: 99%
“…However, the success of supervised feature learning relies on massive amounts of manually labeled data, which is expensive and requires great effort. As an alternative, a prominent paradigm is the so-called self-supervised learning (SSL) which aims to empower machines without explicit annotations and has shown potential in both image and video domains [4]. SSL can be separated into two main categories: exploiting context and videos.…”
Section: Introductionmentioning
confidence: 99%
“…Compared with static images, video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces) provide richer sources of "supervision" [5]. Therefore, it is not trivial to expand the image-based approaches directly to the video domain due to dynamic and fine-grained movements or actions [4]. In order to avoid human labeling, Walker et al [6] proposed to estimate the motion information based on dense optical flow.…”
Section: Introductionmentioning
confidence: 99%